Object Recognition
Identify objects in the visual scene using a neural net.
By the end of this tutorial you will be able to:
- Provide robots with knowledge about the world around them
- Exploit image statistics to identify and localise visual objects
- Define actions that depend on the identity of stimuli in the visual scene
Everybody seems to be talking about Artificial Intelligence (A.I.) at the moment. In recent years this particular branch of scientific enquiry and technological innovation has risen to the fore of public consciousness like no other.
In the process, what we now mean when we talk about A.I. has also incorporated several related but hitherto distinct disciplines, including machine learning, data science, and even what we used to call statistics, to the point that it’s not always clear exactly what we’re all so excited about. For better or worse, there’s no escaping A.I.
Yet A.I. – the pursuit of human thinking in machines – has been going for a considerable time, and is based on foundational research that predates conventional computing. Since Alan Turing asked "Can machines think?" in 1956 and Frank Rosenblatt described the first artificial neural network in 1958, A.I. has fallen in and out of fashion numerous times. Optimism about the potential for technologies to support human-like cognition has come in waves, and is certainly riding high at the moment. But the two biggest waves before this have ended in prolonged periods of pessimism (and reduced research funding) known as the ‘A.I. winters’, the first in the late 1970s and the second spanning the entire 1990s. A key tension that has helped keep this boom-and-bust cycle going for so long is a fundamental debate about the importance of symbols.
From one perspective, symbolic A.I., the main route to recreating human thinking has been to build systems that represent knowledge about the world in terms of discrete symbols and the relationships between them. From another perspective, sub-symbolic A.I., the focus has instead been to build systems that represent knowledge implicitly, in terms of patterns in data, to which symbols may subsequently be assigned. There is no doubt from either perspective that people communicate by exchanging the discrete symbols of our shared languages (words), and that these symbols correspond to real statistical structure in data. But whether or not symbols need to be explicitly maintained is still up for grabs.
Naval-gazing aside, the ability to refer to discrete objects in the world by name when we want to communicate with an embodied A.I. system (a robot) is very useful when it comes to intuitively defining a task for a robot to carry out. So in this tutorial, we will be using a (third-party) A.I. system to assign meaningful symbols to objects in our robots’ visual scenes.
About the Application
The purpose of the application you will develop here is to perform object detection using the Ultralytics YOLOv8 model on images obtained from your robots cameras. The structure of the application is as follows:
Before we get stuck in:
If you are just browsing to get a sense of what's possible, take a look at the code online.
Running the Application
Navigate to the Applications/ObjectRecognition/Python
folder in the SDK Tutorials repository:
Execute the example program:
Investigation
The purpose of this tutorial is to demonstrate how to use the BOW vision channel to perform object detection on the images from a robots cameras. It therefore gives a demonstration of how to obtain images, pass them into an external tool, display and annotate the images using opencv.
In this case the object detection if performed by the Ultralytics implementation of the YOLOv8 model, this model comes pre-trained using the COCO (Common Objects in COntext) dataset and as such is capable of detecting any object in the labels associated with this dataset. To expand on this you could try using a different detection algorithm, such as those built in to opencv, or another external tool.
If you are planning to run this tutorial in the simulator, then we recommend you add some objects from the COCO Dataset to the world for your robot to detect. Some which are readily available in webots are:
- humans/pedestrian/Pedestrian
- objects/animals/Cat
- objects/animals/Sheep
- objects/traffic/StopSign
Similarly, if you are running this tutorial on a physical robot then make sure you have some objects to detect for testing. Potted plants, cups, bottles, mobile phones and of course yourself are all great for testing.
When you run the application an opencv image window will appear containing images from the first camera supplied by the robot. When an object is detected by the YOLO model, a green box will appear around it alongside a label of the object's classification. Bring new objects into view and see how well the off the shelf YOLO detection works.
Code Breakdown
Imports
We start by importing all of the required python modules, including the opencv module (cv2) for displaying images and labelling detected objects:In addition, we import the ultralytics YOLO implementation to be used for visual object recognition:
Connecting to our robot
Next we use quick_connect to connect to the robot. This function will automatically connect to the robot selected in your BOW Hub at runtime, or exiting if the connection was unsuccessful:
Then we load a YOLO model from the imported module. In this case we use 'yolov8n.pt', a pretrained model using the COCO dataset. On first run this may take a short while to download, but will be stored locally for subsequent use:
The whole main loop is wrapped in a try/except block, which destroys the image window and exits the loop by setting the flag to true in the case of a keyboard interrupt (ctrl-c) or if the program is closed:
The main loop
Inside the main loop we call the get function on the vision channel in order to obtain the list of images samples from the robot. We then test for success and that the returned list is not empty, if this test fails we begin the loop again:
We obtain the first image sample from the list and test that it contains valid data. In this simple example we always use the first camera in the list, however you could query the list of image samples in order to choose whichever camera you wanted, for example using the "source" field for the name of the camera, or using the "Transform.Position" field to select an image based on its location on the robot:
The image data is then extracted into the myIm variable, which is passed into the YOLO model, which returns a list of predictions about the objects in the image in the form of Tensors:
Next we check that there is at least one detected object, if so we being iterating through these object in order to draw them onto the image. The two pieces of information we require in this case, are the location of the detection in the frame and the predicted classification of this detection:
The xyxy.numpy() call returns the coordinates of top left and bottom right corners of the objects bounding box in a numpy array, a format that we can easily work with. The classification is obtained by looking up the value in the first element of the data tensor in the model.names list which contains all of the classifications associated with this model.
The bounding box of the object can now be drawn onto the image in our chosen colour using the rectangle function from opencv :
Next we specify the location to write the classification label on the image. As standard we define this as 10 pixels above the top left of the box, however we then check if we are within 10 pixels of teh top of the image, and if so we instead set the location to 20 pixels below the bottom left of the box. We then use this location alongside out chosen colour in the opencv putText function which then draws the text onto the image in this location.
We then use the imshow function to display the resulting image in a window. The camera name obtained from the Source field of the image sample is used as the window title.
Finally, opencv requires a small delay following an imshow call in order to render the image. In this case we use the waitKey function to delay for 1 millisecond and obtain any key press. We also use this as a method of testing for the escape key (27) and break from the main loop if it has been pressed.
Exit
The last required actions are to clean everything up following the exit of the program. We ensuire the image windows have been removes, disconnect from the robot and then close our client.