Object Recognition

By the end of this tutorial you will be able to:

Provide robots with knowledge about the world around them
Exploit image statistics to identify and localise visual objects
Define actions that depend on the identity of stimuli in the visual scene

Everybody seems to be talking about Artificial Intelligence (A.I.) at the moment. In recent years this particular branch of scientific enquiry and technological innovation has risen to the fore of public consciousness like no other.

In the process, what we now mean when we talk about A.I. has also incorporated several related but hitherto distinct disciplines, including machine learning, data science, and even what we used to call statistics, to the point that it’s not always clear exactly what we’re all so excited about. For better or worse, there’s no escaping A.I.

Yet A.I. – the pursuit of human thinking in machines – has been going for a considerable time, and is based on foundational research that predates conventional computing. Since Alan Turing asked "Can machines think?" in 1956 and Frank Rosenblatt described the first artificial neural network in 1958, A.I. has fallen in and out of fashion numerous times. Optimism about the potential for technologies to support human-like cognition has come in waves, and is certainly riding high at the moment. But the two biggest waves before this have ended in prolonged periods of pessimism (and reduced research funding) known as the ‘A.I. winters’, the first in the late 1970s and the second spanning the entire 1990s. A key tension that has helped keep this boom-and-bust cycle going for so long is a fundamental debate about the importance of symbols.

From one perspective, symbolic A.I., the main route to recreating human thinking has been to build systems that represent knowledge about the world in terms of discrete symbols and the relationships between them. From another perspective, sub-symbolic A.I., the focus has instead been to build systems that represent knowledge implicitly, in terms of patterns in data, to which symbols may subsequently be assigned. There is no doubt from either perspective that people communicate by exchanging the discrete symbols of our shared languages (words), and that these symbols correspond to real statistical structure in data. But whether or not symbols need to be explicitly maintained is still up for grabs.

Naval-gazing aside, the ability to refer to discrete objects in the world by name when we want to communicate with an embodied A.I. system (a robot) is very useful when it comes to intuitively defining a task for a robot to carry out. So in this tutorial, we will be using a (third-party) A.I. system to assign meaningful symbols to objects in our robots’ visual scenes.

About the Application

The purpose of the application you will develop here is to perform object detection using the Ultralytics YOLOv8 model on images obtained from your robots cameras. The structure of the application is as follows:

Before we get stuck in:

If you are just browsing to get a sense of what's possible, take a look at the code online.

Running the Application

Navigate to the Applications/ObjectRecognition/Python folder in the SDK Tutorials repository:

cd SDK-Tutorials/Applications/ObjectRecognition/Python

Execute the example program:

python main.py

Investigation

The purpose of this tutorial is to demonstrate how to use the BOW vision channel to perform object detection on the images from a robots cameras. It therefore gives a demonstration of how to obtain images, pass them into an external tool, display and annotate the images using opencv.

In this case the object detection if performed by the Ultralytics implementation of the YOLOv8 model, this model comes pre-trained using the COCO (Common Objects in COntext) dataset and as such is capable of detecting any object in the labels associated with this dataset. To expand on this you could try using a different detection algorithm, such as those built in to opencv, or another external tool.

If you are planning to run this tutorial in the simulator, then we recommend you add some objects from the COCO Dataset to the world for your robot to detect. Some which are readily available in webots are:

humans/pedestrian/Pedestrian
objects/animals/Cat
objects/animals/Sheep
objects/traffic/StopSign

Similarly, if you are running this tutorial on a physical robot then make sure you have some objects to detect for testing. Potted plants, cups, bottles, mobile phones and of course yourself are all great for testing.

When you run the application an opencv image window will appear containing images from the first camera supplied by the robot. When an object is detected by the YOLO model, a green box will appear around it alongside a label of the object's classification. Bring new objects into view and see how well the off the shelf YOLO detection works.

Code Breakdown

Imports

We start by importing all of the required python modules, including the opencv module (cv2) for displaying images and labelling detected objects:

    import bow_api
    import bow_data
    import logging
    import cv2

In addition, we import the ultralytics YOLO implementation to be used for visual object recognition:

    from ultralytics import YOLO

Connecting to our robot

Next we use quick_connect to connect to the robot. This function will automatically connect to the robot selected in your BOW Hub at runtime, or exiting if the connection was unsuccessful:

    myrobot, error = bow_api.quick_connect(app_name="Vision - Object Detection", channels=["vision"])
    if not error.Success:
        log.error("Failed to connect to robot", error)
        sys.exit()

Then we load a YOLO model from the imported module. In this case we use 'yolov8n.pt', a pretrained model using the COCO dataset. On first run this may take a short while to download, but will be stored locally for subsequent use:

    model = YOLO('yolov8n.pt')

The whole main loop is wrapped in a try/except block, which destroys the image window and exits the loop by setting the flag to true in the case of a keyboard interrupt (ctrl-c) or if the program is closed:

The main loop

    try:
        while not stopFlag:
            ...
    except KeyboardInterrupt or SystemExit:
        cv2.destroyAllWindows()
        stopFlag = True

Inside the main loop we call the get function on the vision channel in order to obtain the list of images samples from the robot. We then test for success and that the returned list is not empty, if this test fails we begin the loop again:

    image_list, err = myrobot.vision.get(True)
    if not err.Success or len(image_list) == 0:
        continue

We obtain the first image sample from the list and test that it contains valid data. In this simple example we always use the first camera in the list, however you could query the list of image samples in order to choose whichever camera you wanted, for example using the "source" field for the name of the camera, or using the "Transform.Position" field to select an image based on its location on the robot:

    img_data = image_list[0]
    if img_data is not None and img_data.new_data_flag:
        ...

The image data is then extracted into the myIm variable, which is passed into the YOLO model, which returns a list of predictions about the objects in the image in the form of Tensors:

    results = model.predict(source=myIm, show=False, stream_buffer=False, verbose=False)

Next we check that there is at least one detected object, if so we being iterating through these object in order to draw them onto the image. The two pieces of information we require in this case, are the location of the detection in the frame and the predicted classification of this detection:

if len(results) > 0:
    for box in results[0].boxes.cpu():
        corners = box.xyxy.numpy()[0]
        classification = model.names[int(box.data[0][-1])]

The xyxy.numpy() call returns the coordinates of top left and bottom right corners of the objects bounding box in a numpy array, a format that we can easily work with. The classification is obtained by looking up the value in the first element of the data tensor in the model.names list which contains all of the classifications associated with this model.

The bounding box of the object can now be drawn onto the image in our chosen colour using the rectangle function from opencv :

    myIm = cv2.rectangle(myIm, (int(corners[0]), int(corners[1])), (int(corners[2]), int(corners[3])),
    colour, thickness=3)

Next we specify the location to write the classification label on the image. As standard we define this as 10 pixels above the top left of the box, however we then check if we are within 10 pixels of teh top of the image, and if so we instead set the location to 20 pixels below the bottom left of the box. We then use this location alongside out chosen colour in the opencv putText function which then draws the text onto the image in this location.

    label_x = int(corners[0])
    label_y = int(corners[1]) - 10  # Position above the box by default
    # Check if the label is going off-screen
    if label_y < 10:
        label_y = int(corners[3]) + 20  # Position below the box
 
    cv2.putText(myIm, classification, (label_x, label_y), cv2.FONT_HERSHEY_SIMPLEX, 0.7, colour, 1)

We then use the imshow function to display the resulting image in a window. The camera name obtained from the Source field of the image sample is used as the window title.

    cv2.imshow(img_data.source, myIm)

Finally, opencv requires a small delay following an imshow call in order to render the image. In this case we use the waitKey function to delay for 1 millisecond and obtain any key press. We also use this as a method of testing for the escape key (27) and break from the main loop if it has been pressed.

    j = cv2.waitKeyEx(1)
    if j == 27:
        break

Exit

The last required actions are to clean everything up following the exit of the program. We ensuire the image windows have been removes, disconnect from the robot and then close our client.

    cv2.destroyAllWindows()
    myrobot.disconnect()
    bow_api.close_client_interface()

The Complete Code

    # Imports
    import bow_api
    import bow_data
 
    import logging
    import numpy as np
    import sys
    import cv2
    from ultralytics import YOLO
 
    # Create a logger for our robot connection and print version info
    print(bow_api.version())
 
    # Connect to the robot selected in BOW Hub
    myrobot, error = bow_api.quick_connect(app_name="Vision - Object Detection", channels=["vision"])
    if not error.Success:
        print("Failed to connect to robot", error)
        sys.exit()
 
    # Create a flag so we can exit our main loop
    stopFlag = False
 
    # Create our YOLO object detection model
    model = YOLO('yolov8n.pt')  # load an official detection model
 
    # Define a colour for our image annotations
    colour = (61, 201, 151)
 
    try:
        while not stopFlag:
            # Sense
 
            # Retrieve all images from the robot using the vision modality
            image_list, err = myrobot.vision.get(True)
            # Test for failures
            if not err.Success or len(image_list.Samples) == 0:
                continue
 
            # For this example we will always use the first camera in the list
            img_data = image_list.Samples[0]
 
            # Test for new and valid data
            if img_data is not None and img_data.NewDataFlag:
 
                # Extract OpenCV image
                if img_data.ImageType == bow_data.ImageSample.ImageTypeEnum.RGB:
                    npimage = np.frombuffer(img_data.Data, np.uint8).reshape(
                        [int(img_data.DataShape[1] * 3 / 2), img_data.DataShape[0]])
                    myIm = cv2.cvtColor(npimage, cv2.COLOR_YUV2RGB_I420)
                elif img_data.ImageType == bow_data.ImageSample.ImageTypeEnum.DEPTH:
                    myIm = np.frombuffer(img_data.Data, np.uint16).reshape(
                        [img_data.DataShape[1], img_data.DataShape[0]])
                else:
                    continue
 
                # Pass image into the yolo model for object detection
                results = model.predict(source=myIm, show=False, stream_buffer=False, verbose=False)
 
                # Iterate though detected objects and draw them on the image
                if len(results) > 0:
                    for box in results[0].boxes.cpu():
                        # Extract data from results in a useful form
                        corners = box.xyxy.numpy()[0] # Corners of the detected objects bounding box
                        classification = model.names[int(box.data[0][-1])] # models predicted object classification
 
                        # Draw the bounding box onto the image in our chosen colour
                        myIm = cv2.rectangle(myIm, (int(corners[0]), int(corners[1])), (int(corners[2]), int(corners[3])),
                                             colour, thickness=3)
 
                        # Set the label position
                        label_x = int(corners[0])
                        label_y = int(corners[1]) - 10  # Position above the box by default
                        # Check if the label is going off-screen
                        if label_y < 10:
                            label_y = int(corners[3]) + 20  # Position below the box
 
                        # Draw the label onto the image in our chosen colour
                        cv2.putText(myIm, classification, (label_x, label_y),
                                    cv2.FONT_HERSHEY_SIMPLEX, 0.7, colour, 1)
 
                # Display the image
                cv2.imshow(img_data.Source, myIm)
 
            # Check for keyboard escape
            j = cv2.waitKeyEx(1)
            if j == 27:
                break
 
    # Kill on ctrl-c or closure
    except KeyboardInterrupt or SystemExit:
        cv2.destroyAllWindows()
        print("Closing down")
        stopFlag = True
 
    # Handle disconnect of robot on exit
    cv2.destroyAllWindows()
    myrobot.disconnect()
    bow_api.close_client_interface()

Object Recognition

Sensing

Decisions

Action

Setup

Robots

On this page