Combining Shinkai tools : from PowerPoint to audio
Introduction
In this tutorial, you will learn to combine Shinkai tools to create an AI tool that extracts the text content of a .pptx presentation, generates a text for a lesson about the presentation, and generates an audio file of this lesson. This tool is available in the Shinkai AI Store.
You will learn how to :
- build and add features both using the Shinkai AI assistance and manually
- combine Shinkai tools efficiently (optional features, customizability, config validation, error handling, design decisions)
- implement Optical Character Recognition
- implement text to speech
- use the created tool
This tutorial is a step-by-step guide on how to implement the full tool. You can find the complete code below for reference, but we will go over its elements one by one. And you can see some usage examples in the last section of this tutorial (Part 5 : Using the tool).
Now let’s see how to recreate this tool and learn its features and implementation details.
Prerequisites
To follow this tutorial, you will need :
- the latest version of Shinkai Desktop installed
- to install Tesseract for OCR
- to install the ElevenLabs text-to-speech tool from the Shinkai AI Store and configure it
- an ElevenLabs API key
Part 0 : Trying to build the full tool in 1 go with Shinkai AI assisted tool creation UI
Shinkai offers an effortless tool building experience thanks to its AI assisted tool creation UI, where even libraries dependencies and tool metadata are handled automatically.
You could try to build a working prototype of the full tool using 1 detailed prompt and a performant LLM.
In the tool creation UI, select a performant LLM (e.g. gpt_4o, shinkai_free_trial), select Python, activate the 2 tools “shinkai_llm_prompt_processor” and “eleven_labs_text_to_speech”, write a prompt describing the tool well, and execute it.
For a good result your prompt should be detailed and clearly describe :
- the goal of the tool to create and its steps
- how each of the selected tools should be used
- what you would want in configuration versus inputs
- which feature should be optional
- how to handle errors
Below is an example of a promt to generate a full prototype of our PowerPoint to audio lesson tool. It uses tags to make things clear for the LLM. At the very least such prompt will create a good code flow for the intended tool, from which you can debug, edit, improve.
Alternatively, you can build the tool progressively, step-by-step, by first building the content extraction part, then adding the lesson text generation, and finally the audio generation. Each step can be done with AI assistance and/or manually.
Below, you can study a step-by-step implemention of the tool.
Part 1 : Extracting the text content from a .pptx file
1.0 Using Shinkai AI assisted tool creation UI
You can try to build the content extraction feature first using the AI assistance, and then later on add the other features.
To do so, do not select any tool as this feature does not rely on any, and use a good prompt. Because the prompt would be short as it is about just one feature, you can make it very thorough and add details on how to build the tool without risking to overwhelm the LLM. Here is an example prompt to create a tool that extracts the text content from a .pptx file :
At the very least such prompt should create a good code flow for the intended feature, from which you can debug, edit, improve.
Similar prompts were actually used to build the full code above, with step-by-step improvements through prompting and few manual coding.
Below you’ll find a full description of how to code the content extraction feature.
1.1 Defining the text extraction process
Import what will be needed :
Define the configuration for the Tesseract executable path and the input for .pptx file path :
Create the output class for the content, error messages and status :
Define 3 functions to extract text from text blocks, tables and charts :
Create a function to extract the content from shapes using the functions defined above, plus use Tesseract OCR for picture shapes :
Define a function to read the presentation from either URL or local file path. It’s a step to check the file is readable :
1.2 Retrieving the content shape by shape and slide by slide
Create a function that applies the content extraction slide by slide and shape by shape :
1.3 Validate the configuration
Implement a validation function to stop the tool and log errors if there are issues with the Tesseract OCR implementation.
1.4 Run function to execute all the processes
Define a run function using all the functions defined above. At the end add a step to check if the extracted content is empty or not. It’s a useful step because later the tool will use this extracted content to generate a lesson text, and this check will ensure there is a content, and stop the tool and inform the user if there isn’t, saving compute and time.
Part 2 : Adding a LLM prompt processor to generate a lesson text
2.0 Using Shinkai AI assisted tool creation UI
Now you can use the AI assisted tool creation to add a step which generates the lesson text, using the slides content extracted in the first step.
To do so, activate the tool ‘shinkai_llm_prompt_processor’, and use a prompt similar to this one :
At the very least such prompt should add a good code flow for the intended additional feature, from which you can debug, edit, improve.
Similar prompts were actually used to add this next feature, with step-by-step improvements through prompting and few manual coding.
Below you’ll find a full description of the code to add a lesson text generation feature.
2.1 Setting up the lesson text generation feature
Import the ‘shinkai_llm_prompt_processor’ tool. Also add ‘re’ to the imports, it will be used to clean the generated text.
Add an input for additional instructions to generate the lesson text, so that the user can customize it. Set default to ‘none’.
Add an output for the generated lesson :
2.2 Using an elaborate prompt to generate optimal lesson text
In the run function, add a step to define a detailed prompt for the text generation. Give some context describing the type of content the LLM will use and its specificities. Include formatting instructions. Include the optional additional instructions coming from the user. Organise it well and use tags to make things clear for the LLM :
2.3 Calling the LLM prompt processor tool
Just under, add a step to call the LLM prompt processor tool, using the prompt defined above.
2.4 Cleaning the text
Along with the previously defined functions at the top of the code, add a function to clean the generated lesson text from special characters, in case the LLM includes some despite our prompt format instructions :
Add a step in the run function to use it :
Edit the outputs of the run function to also include the cleaned generated text.
Part 3 : Adding an optional text to speech feature to create an audio file of the lesson
3.0 Using Shinkai AI assisted tool creation UI
Now you can use the AI assisted tool creation to add a final optional step which generates an audio file of the cleaned lesson text generated by the 2nd feature of the tool.
To do so, activate the tool ‘eleven_labs_text_to_speech’, and use a prompt similar to this one :
At the very least such prompt should add a good code flow for the intended last feature, from which you can debug, edit, improve.
Similar prompts were actually used to add this next feature, with step-by-step improvements through prompting and few manual coding.
Below you’ll find a full description of the code to add the audio file generation.
3.1 Setting up the audio file generation feature
Near the start of the code add the ‘eleven_labs_text_to_speech’ tool to the import. And also ‘shutil’ (used for file operations).
Edit config and output classes to also include the option to generate the audio and the optional audio file.
Define a function to get the name of the .pptx file. It will be used to save the audio file with the same name.
Add a step to the validate_config function to also check the configuration of the optional audio generation.
3.2 Calling the text to speech tool
Add a step to the run function to use the ‘eleven_labs_text_to_speech’ tool. This step is optional according to the configuration. Add a step to change the name of the audio file generated by the text-to-speech tool : make it more user friendly by simply using the name of the original .pptx file. Also include an error message if the audio file generation failed.
Edit the outputs of the run function to also include the generated audio file.
Please note that you could also modify the tool to use another text-to-speech provider, even a local one.
Part 4 : Improving the metadata of the tool
Shinkai automated the tool metadata generation, but you can improve it.
Good tool metadata should include :
- an explicit tool title
- a thorough description (features, options, requirements, extra information)
- explicit descriptions for configurations and inputs
- adequate usable keywords to trigger the tool
Go to the metadata section, and improve the above. Below is a good metadata for the tool.
Title :
Description :
Metadata JSON :
Now the tool should be complete. Save it.
Below you’ll find usage examples.
Part 5 : Using the tool ‘PPTX Content Extractor With OCR And Audio Lesson Generator’
5.1 Installing extra components and setting up configurations
Install Tesseract for OCR, and set its executable path in the configuration of the ‘PPTX Content Extractor With OCR And Audio Lesson Generator’ tool.
Install the ‘eleven_labs_text_to_speech’ tool from the Shinkai AI Store. Get an ElevenLabs API key with some credits. Go to the configuration tab of this ElevenLabs Shinkai tool and set your API key and pick a voice.
Set audio generation to ‘yes’ or ‘no’ in the configuration of the ‘PPTX Content Extractor With OCR And Audio Lesson Generator’.
5.2 Usage examples
To generate an audio lesson of a .pptx file, have the audio generation set on ‘yes’ in configuration, and in your prompt simply mention the filename.
To simply interact through prompts with the content of the .pptx file, have the audio generation set on ‘no’, include the file path in your prompt, and add instructions.
Because the content is extracted slide by slide, you can also ask about specific slides.
Was this page helpful?