OpenAI Integration

Welcome to this tutorial series in robotics powered by the BOW SDK. This example was originally showcased in a BOW webinar "Bring your robot dog to life with the ChatGPT API and BOW SDK" which you can watch below.

Recommended Robots

We highly recommend a Quadruped robot for this tutorial, so we suggest:

InMotion Robotics - Lite3
InMotion Robotics - X30
Unitree - Go2

Prerequisites

Before trying these tutorials make sure you have followed the instructions from the dependencies step to set up the development environment for your chosen programming language.

These tutorials also assume you have installed the BOW Hub available for download from https://bow.software and that you have subscribed with a Standard Subscription (or above) or using the 30 day free trial which is required to simulate robots.

This tutorial requires that you have a paid openai account and have an API key setup and exported as an environment variable "OPENAI_API_KEY=sk-xxxxx". See the OpenAI documentation for more details on setting up the OpenAI API.

What is the aim of this tutorial?

Since OpenAI rocketed into relevancy and the mainstream, people have been asking how do we embody this powerful technology so that it is not limited to responding only with text and images, and can instead interact with the real world. This tutorial aims to give a very brief example of how combining the BOW SDK and the OpenAI API can make this a reality without any complexity.

For this tutorial we are going to focus on the tasks of object detection and searching for objects. Although OpenAI provides methods of analysing images, this process is (currently) not fast enough to be used for real world control, as such, object detection will be performed locally by a computer vision model, YOLOv8. This model comes pre-trained using the COCO (Common Objects in COntext) dataset and as such is capable of detecting any object in the labels associated with this dataset. The results of this object detection can then be communicated to the OpenAI API.

Further we have to provide functions which the API can call to control the robot, in this case it will be a very basic search function which allows the robot to rotate on the spot, therefore changing its view of the world, until the target object becomes visible.

These functions will all be contained within a simple-to-use GUI which not only displays the robots viewpoint, but also allows you to communicate with the OpenAI powered robot and read its responses.

Sense

Connect to a robot by calling QuickConnect
Open a stream of communication with the OpenAI Assistant
Get the images sampled by the robot by calling GetModality("vision")
Perform object detection locally using YOLOv8 the cutting edge computer vision model
Overlay object detections onto the images and display them within the GUI
Communicate the state of the robot to the OpenAI Assistant
Provide information to the assistant in the form of user messages

Decide

In this case decisions are all being made by the OpenAI Assistant. Based on the information we provide to the assistant it can choose to:

Search for an object in the COCO dataset
Stop
Request a list of currently visible objects
Request a list of currently running functions
Speak to the user
Ask the user for clarification/further information

Act

Assistant triggers one of the above actions
Assistant communicates its chosen action to the user in the form of a message
Message is passed to the robot to be spoken
If a search is trigger the robot will being to rotate on the spot using its method of locomotion.

Preparation

Creating an OpenAI Assistant

Before running the demo it is first necessary to create an openai assistant. This is an instance of an openai model which not only responds to queries but also has the ability to call functions from within your application. To understand more about OpenAI Assistants read their overview.

To create an assistant navigate to the "Assistant Functions" folder within "OpenAI Integration" directory, which is part of the SDK-Tutorials repository.

cd 'SDK-Tutorials/OpenAI_Integration/Assistant_Functions'

Execute the "openai_create_assistant.py" script:

python openai_create_assistant.py

The output of this script contains your assistant ID, this is a string in the form "asst_abc123", take note of this string as it is the reference to your created assistant. It is also possible to view, create and delete assistants in your browser by logging into your openai account.

Replace the assistant ID currently declared on line 9 of openai_brain.py with your new assistant ID:

from openai import AssistantEventHandler
 
assistant_id = "asst_BF4jtnh3Mt0lAA2p4Uyvzm9f"
 
class EventHandler(AssistantEventHandler):

from openai import AssistantEventHandler
 
assistant_id = "asst_abc123"
 
class EventHandler(AssistantEventHandler):

Preparing your simulated world

If you are running this demonstration in the simulator, then we recommend you add some objects from the COCO Dataset to the world for your robot to detect. Some which are readily available in webots are:

humans/pedestrian/Pedestrian
objects/animals/Cat
objects/animals/Sheep
objects/traffic/StopSign

Running and interacting with the tutorial

Simply run the gui.py script in the "OpenAI_Integration" directory to begin the tutorial.

python openai_create_assistant.py

A gui will appear which, on the left, will show images from the robot's camera with object detections being overlaid.

To communicate with your robot assistant type messages in the white box on the right side of the gui and hit enter or press the send button. You will see a response from the ChatGPT assistant in the black output box, which will not only answer your questions but also command your robot to perform its available actions and communicate these actions to you.

If your robot has speech capabilities, these responses will also be vocalised by your robot.

OpenAI >< BOW GUI Window

Code Structure

The code has 3 key python files:

gui.py

This file is execution point for this project and contains:

The gui description
A small class for storing detected objects and their details
The main function. The main function launches the gui, connects to the robot, begins sampling images from the robot and passing them into the local yolo model, the output of this model is then parsed to store details of the detected objects. These objects are then drawn onto the image and displayed in the gui.