Tutorial Session in Week 04

Preparation

Before the class, you should make the following preparations:

  • Bring your laptop (with a USB port)
  • Configure the Python environment
  • Prepare your favorite IDE

You can use Ubuntu (Recommended), Windows, or macOS for today’s workshop. 

Below are the instructions for the Ubuntu system. For other systems, please search online for tutorials.

Python
Most commonly used commands
Ctrl + Alt + T:Open terminal
Ctrl + Shift + C:Copy in terminal
Ctrl + Shift + V:Paste in terminal
Python
conda create -n Me336Spring python=3.8   
conda activate Me336Spring
# If installation fails, try using Tsinghua Source for installation
pip install opencv-python==4.4.0.46      
# pip install opencv-python==4.4.0.46 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install opencv-contrib-python==4.4.0.46 
# pip install opencv-contrib-python==4.4.0.46 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install numpy
# pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install scikit-learn
# pip install scikit-learn -i https://pypi.tuna.tsinghua.edu.cn/simple

Class 1

A robotics engineer must learn computer vision, a crucial component in building and programming robots. Computer vision allows robots to perceive and interpret their surroundings through image processing and analysis, enabling them to make decisions, recognize objects and respond to environmental changes. Understanding computer vision enables a robotics engineer to design and develop more advanced and autonomous robots which can perform complex tasks and interact with their environment more effectively.

Image Acquisition

  • Choose a device: digital cameras, webcams, and specialized machine vision cameras.
  • Connect the device
  • Configure the device: Configure the device settings, such as resolution, exposure time, and gain, to ensure that the images captured meet your needs.
  • Acquire images: Use software or programming to acquire images from the device. This typically involves opening a connection to the device, specifying the settings, and capturing the images.
  • Store the images: Store the images on your computer or another storage device for further processing and analysis.
  • An example is shown on the right.
Python
import cv2

# Initialize camera
cap = cv2.VideoCapture(0)

# Check if camera opened successfully
if not cap.isOpened():
    print("Error opening video capture")
    exit()

# Set camera properties (optional)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
# cap.set(cv2.CV_CAP_PROP_FPS, 30);
# cap.set(cv2.CV_CAP_PROP_BRIGHTNESS, 1);
# cap.set(cv2.CV_CAP_PROP_CONTRAST,40);
# cap.set(cv2.CV_CAP_PROP_SATURATION, 50);
# cap.set(cv2.CV_CAP_PROP_HUE, 50);
# cap.set(cv2.CV_CAP_PROP_EXPOSURE, 50);

while True:
    # Capture frame-by-frame
    ret, frame = cap.read()

    # Display the resulting frame
    cv2.imshow('Camera', frame)

    # Wait for key press
    key = cv2.waitKey(1) & 0xFF

    # Check if 's' key is pressed to save image
    if key == ord('s'):
        cv2.imwrite('image.jpg', frame)
        print("Image captured!")
        break

# Release the camera and close all windows
cap.release()
cv2.destroyAllWindows()

Image Pre-processing

Image pre-processing is a crucial step in machine vision, as it can significantly impact the accuracy and performance of the final results. It is crucial to choose the proper pre-processing techniques for the specific application and to use them judiciously to avoid distorting the original information in the image.It typically involves several steps to clean up and enhance the images.

Noise Reduction

  • Removing noise or random fluctuations in the image data to improve the quality and accuracy of the image.
  • For the example on the right
    • This code first loads an image using OpenCV’s imread function and applies a Gaussian blur using GaussianBlur to reduce noise. The resulting filtered image is then displayed using imshow.
    • The Gaussian blur is a popular method for noise reduction in images. It replaces each pixel value in the image with the weighted average of its neighboring pixels, effectively smoothing the image and reducing high-frequency noise. The size of the Gaussian kernel and the standard deviation can be adjusted to control the amount of smoothing applied to the image.
Python
import cv2

# Load the image
img = cv2.imread('image.jpg')

# Apply Gaussian filter
filtered = cv2.GaussianBlur(img, (5, 5), 0)

# Display the original and filtered images side by side
cv2.imshow('Original', img)
cv2.imshow('Filtered', filtered)
cv2.waitKey(0)
cv2.destroyAllWindows()

Image Correction

  • Correcting issues such as color cast or distortion in the image.
  • For the example on the right
    • This code first loads an image using OpenCV’s imread function, then convert the image to YUV color space using cv2.cvtColor() function, and applies histogram equalization using equalizeHist to correct the color balance. The corrected image is then converted back to the original color space using cvtColor and displayed using imshow.
    • You can adjust the parameters of the histogram equalization function to achieve the desired correction. Additionally, you can also apply histogram equalization to other color spaces such as HSV, LAB, or LCH to correct color balance.
Python
import cv2

# Load the image
img = cv2.imread('image.jpg')

# Apply histogram equalization to each color channel
equalized = cv2.cvtColor(img, cv2.COLOR_BGR2YUV)
equalized[:, :, 0] = cv2.equalizeHist(equalized[:, :, 0])
equalized = cv2.cvtColor(equalized, cv2.COLOR_YUV2BGR)

# Display the original and equalized images side by side
cv2.imshow('Original', img)
cv2.imshow('Equalized', equalized)
cv2.waitKey(0)
cv2.destroyAllWindows()

Image Enhancement

  • Improving the visibility of features in the image, such as increasing contrast or sharpness.
  • For the example on the right
    • This code first loads an image using OpenCV’s imread function, converts it to grayscale using cvtColor, and applies histogram equalization using equalizeHist to enhance the contrast. The resulting enhanced image is then displayed using imshow.
    • Histogram equalization is a simple yet effective method for image enhancement. It adjusts the distribution of pixel values in an image to improve the image contrast. This can make features in the image more visible and easier to distinguish, which can be particularly useful for images with low contrast or poor lighting conditions.
Python
import cv2
import numpy as np

# Load the image
img = cv2.imread('image.jpg')

# Convert the image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply a histogram equalization to enhance the image contrast
enhanced = cv2.equalizeHist(gray)

# Show the original and enhanced images
cv2.imshow('Original Image', gray)
cv2.imshow('Enhanced Image', enhanced)
cv2.waitKey(0)
cv2.destroyAllWindows()

Image Segmentation

  • Partitioning an image into multiple regions corresponds to a different object or background.
  • For the example on the right
    • This code first loads an image using OpenCV’s imread function, converts it to grayscale using cvtColor, and applies thresholding using threshold to segment the image into foreground and background regions. The resulting thresholded image is then displayed using imshow.
    • Thresholding is a simple yet effective method for image segmentation. It converts an image into a binary image by setting all pixels above a specified threshold to the maximum value (255) and all pixels below the threshold to the minimum value (0 in this case). This effectively separates the image into foreground and background regions, making it easier to identify objects or features in the image. The threshold value can be adjusted to control the level of detail in the segmented image.
Python
import cv2
import numpy as np

# Load the image
img = cv2.imread('image.jpg')

# Convert the image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply thresholding to segment the image
_, thresholded = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)

# Show the original and thresholded images
cv2.imshow('Original Image', img)
cv2.imshow('Thresholded Image', thresholded)
cv2.waitKey(0)
cv2.destroyAllWindows()

Resizing and Cropping

  • Changing the size or aspect ratio of the image or removing unwanted regions from the image.
  • For the example on the right
    • This code first loads an image using OpenCV’s imread function, resizes it using resize to a specific size (600×400 pixels in this case), and crops it using indexing to a specific region (100 to 300 pixels in the vertical direction and 200 to 400 pixels in the horizontal direction in this case). The original, resized, and cropped images are then displayed using imshow.
    • Resizing and cropping are common pre-processing steps for images in computer vision and image processing. Resizing can be used to change the size of an image, which can be important for image processing algorithms that require a specific size or aspect ratio. Cropping can extract a particular region of interest from an image, which can help remove unwanted parts of an image or focus on specific features.
Python
import cv2
import numpy as np

# Load the image
img = cv2.imread('image.jpg')

# Resize the image
resized = cv2.resize(img, (600, 400), interpolation=cv2.INTER_AREA)

# Crop the image
cropped = resized[100:300, 200:400]

# Show the original, resized, and cropped images
cv2.imshow('Original Image', img)
cv2.imshow('Resized Image', resized)
cv2.imshow('Cropped Image', cropped)
cv2.waitKey(0)
cv2.destroyAllWindows()

Feature Detection and Extraction

  • Feature detection and extraction is an essential step in computer vision that involves finding and describing the key points or features in an image that are relevant to a specific task, such as object recognition, image matching, or tracking. There are many different feature detection and extraction algorithms, each with strengths and weaknesses, depending on the specific task and the type of image.
  • Shown on the right is an example of feature detection and extraction in Python using the OpenCV library and the SIFT (Scale-Invariant Feature Transform) algorithm.
    • This code first loads an image using OpenCV’s imread function, converts it to grayscale using cvtColor, and creates a SIFT object using cv2.SIFT_create. The detectAndCompute function is then used to detect key points in the image and compute their descriptors. The resulting key points are drawn on the image using drawKeypoints, and the original and keypoint-annotated images are displayed using imshow.
    • SIFT is a popular feature detection and extraction algorithm that is robust to image scale, orientation, and affine distortion. It works by detecting distinctive points in an image and computing a descriptor that describes the local appearance of the image at that point. The descriptors can then be used for tasks such as image matching, object recognition, or tracking. The key points and descriptors can also be visualized on the image to help understand and debug.
Python
import cv2

# Load the input image
input_image = cv2.imread('image.jpg')

# Convert the input image to grayscale
gray_image = cv2.cvtColor(input_image, cv2.COLOR_BGR2GRAY)

# Create a SIFT object for feature detection and extraction
sift = cv2.SIFT_create()

# Detect keypoints and compute descriptors for the input image
keypoints, descriptors = sift.detectAndCompute(gray_image, None)

# Draw the detected keypoints on the input image
output_image = cv2.drawKeypoints(input_image, keypoints, None)

# Show the output image
cv2.imshow('Output Image', output_image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Image Analysis and Interpretation

  • Image analysis and interpretation extract meaningful information and insights from images. It involves a combination of computer vision techniques, such as feature detection and extraction, image segmentation, and machine learning, to understand the content of an image and make decisions based on that information.
  • For example, an image analysis and interpretation system could detect and classify objects in an image, recognize handwritten text, analyze medical images for signs of disease, or interpret satellite images for environmental monitoring.
  • Here is an example of image analysis and interpretation for shape detection:
    • This code creates a window called ‘Canvas’, and sets up a mouse callback function that draws on the canvas when the left mouse button is pressed and dragged. When the mouse button is released, the code converts the canvas to grayscale, thresholds it, finds the contours, and identifies the shape of the outline by checking the number of corners using cv2.approxPolyDP(). Finally, the code displays the canvas on the screen and prints the final shape when the user presses the ‘q’ key.
Python
import cv2
import numpy as np

# Set up the canvas
canvas = np.zeros((480, 640, 3), dtype=np.uint8)
drawing = False
ix, iy = -1, -1
shape = ''

# Define the mouse callback function
def draw_shape(event, x, y, flags, param):
    global canvas, drawing, ix, iy, shape

    if event == cv2.EVENT_LBUTTONDOWN:
        drawing = True
        ix, iy = x, y

    elif event == cv2.EVENT_MOUSEMOVE:
        if drawing == True:
            cv2.line(canvas, (ix, iy), (x, y), (0, 0, 255), 2)
            ix, iy = x, y

    elif event == cv2.EVENT_LBUTTONUP:
        drawing = False
        cv2.line(canvas, (ix, iy), (x, y), (0, 0, 255), 2)

        # Convert the canvas to grayscale and threshold it
        gray = cv2.cvtColor(canvas, cv2.COLOR_BGR2GRAY)
        _, thresh = cv2.threshold(gray, 10, 255, cv2.THRESH_BINARY)

        # Find contours in the thresholded image
        contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

        # Identify the shape of the outline
        for cnt in contours:
            area = cv2.contourArea(cnt)
            if area < 500:
                # Ignore small contours
                continue
            peri = cv2.arcLength(cnt, True)
            approx = cv2.approxPolyDP(cnt, 0.04 * peri, True)
            if len(approx) == 3:
                shape = 'Triangle'
            elif len(approx) == 4:
                shape = 'Rectangle'
            else:
                shape = 'Circle'

# Set up the window and mouse callback
cv2.namedWindow('Canvas')
cv2.setMouseCallback('Canvas', draw_shape)

# Main loop
while True:
    cv2.imshow('Canvas', canvas)
    key = cv2.waitKey(1) & 0xFF

    if key == ord('q'):
        break
    elif key == ord('c'):
        canvas = np.zeros((480, 640, 3), dtype=np.uint8)
        shape = ''

# Print the final shape
print('Final Shape: {}'.format(shape))

cv2.destroyAllWindows()

Feedback and Control

  • Machine vision systems often provide feedback to control other systems, such as robots, to perform tasks based on the visual information they receive.

Operational Efficiency

  • Software operational efficiency is critical in computer vision as it can impact the performance and speed of computer vision systems and applications. An efficient computer vision system will quickly process images and video frames, which is critical for real-time applications. Inefficient systems can lead to slow processing times, high memory usage, and reduced accuracy, impacting overall performance and user experience.
  • Take two methods for tracking human faces in real-time as an example.
    • Sample Method 1: This code uses the haarcascade_frontalface_default.xml classifier from OpenCV’s pre-trained Haar cascades to detect faces in the video frames. The rectangles around the detected faces are drawn in real-time and displayed in a window named Frame.
    • Sample Method 2: In this example, the face_cascade is a trained Haar cascade classifier for face detection, which is loaded from a file. The cap object is used to capture the video frames from the default camera. The face_rect variable stores the initial window for the face ROI, which is updated in each iteration of the loop using the MOSSE tracker function.
  • Model File (Click to Download)
Sample Method 1
import cv2
import time

# Load the pre-trained classifier
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

# Initialize the video capture device
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
# Initialize variables for calculating average time
total_time = 0
frame_count = 0

while True:
    # Capture frame-by-frame and record start time
    ret, frame = cap.read()
    start_time = time.time()

    # Convert the frame to grayscale
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Detect faces in the grayscale frame
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)

    # Draw a rectangle around each detected face
    for (x, y, w, h) in faces:
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)

    # Display the resulting frame
    cv2.imshow('frame', frame)

    # Calculate time spent on this frame and update variables
    end_time = time.time()
    time_spent = end_time - start_time
    total_time += time_spent
    frame_count += 1

    # Check for key press and exit if necessary
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Release the capture device and close all windows
cap.release()
cv2.destroyAllWindows()

# Calculate and print average time spent per frame
avg_time_per_frame = total_time / frame_count
print("Average time spent per frame: {:.2f} seconds".format(avg_time_per_frame))
Sample Method 2
import cv2
import time

# Load the pre-trained Haar Cascade classifier
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

# Create the MOSSE tracker
tracker = cv2.TrackerMOSSE_create()

# Load the video
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
# Initialize variables for time measurement
frame_count = 0
total_time = 0

# Read the first frame
ret, frame = cap.read()

# Detect the initial face(s) using the Haar Cascade classifier
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.3, minNeighbors=5)

# Initialize the tracker with the first face (if any)
if len(faces) > 0:
    # Select the first face
    (x, y, w, h) = faces[0]
    # Initialize the tracker with the first face
    tracker.init(frame, (x, y, w, h))

# Loop over each frame in the video
while True:
    # Read a frame from the video
    ret, frame = cap.read()
    # If we reached the end of the video, break out of the loop
    if not ret:
        break

    # Start the timer
    start_time = time.time()

    # Track the face using the MOSSE tracker (if initialized)
    if tracker:
        # Update the tracker with the current frame
        ok, bbox = tracker.update(frame)
        # If the tracking was successful, draw a rectangle around the face
        if ok:
            (x, y, w, h) = [int(v) for v in bbox]
            cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
        # If the tracking failed (e.g., the face went out of the frame), reset the tracker
        else:
            tracker = None

    # If the tracker is not initialized or failed, detect faces using the Haar Cascade classifier
    if not tracker:
        # Convert the frame to grayscale
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        # Detect faces in the grayscale image using the Haar Cascade classifier
        faces = face_cascade.detectMultiScale(gray, scaleFactor=1.3, minNeighbors=5)
        # If a face is detected, initialize the tracker with the first face
        if len(faces) > 0:
            (x, y, w, h) = faces[0]
            tracker = cv2.TrackerMOSSE_create()
            tracker.init(frame, (x, y, w, h))

    # Show the frame with the detected/tracked face
    cv2.imshow('frame', frame)

    # Stop the timer and update the time variables
    end_time = time.time()
    total_time += end_time - start_time
    frame_count += 1

    # Wait for a key press, and exit the loop if the 'q' key is pressed
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Release the resources (camera and window)
cap.release()
cv2.destroyAllWindows()

# Calculate the average time cost per frame
average_time = total_time / frame_count
print('Average time cost per frame: {:.2f} seconds'.format(average_time))

Class 2 on Mar 15th

Hardware Toolkit

Part 1: Pose Estimation

  • Pose estimation is the process of determining the position and orientation of an object in a given environment. It is commonly used in computer vision and robotics applications, such as human-computer interaction, augmented reality, and autonomous navigation. Pose estimation algorithms use a combination of computer vision techniques, such as feature detection, image segmentation, and machine learning, to estimate the position and orientation of an object in 3D space based on its image or a sequence of images. The accuracy of pose estimation results can be affected by various factors, such as the quality of the images, the complexity of the object, and the presence of occlusions.

Task 1: Camera Calibration

Camera calibration is essential before pose estimation because it helps correct the camera’s intrinsic and extrinsic parameters. These parameters include:

  1. Intrinsic Parameters: Characterize the camera’s internal optics and geometry, such as the focal length, principal point, and lens distortion.
  2. Extrinsic Parameters: Characterize the position and orientation of the camera about the world coordinate system, such as the position of the camera, its orientation, and the skew between the x and y axes
Intrinsic Parameters.

Focal Length, fx, fy:

The focal length is the distance between the pinhole and the film (a.k.a. image plane).

Axis Skew, s

Skew coefficient, which is non-zero if the image axes are not perpendicular.

Principal Point Offset x0, y0:

The camera’s “principal axis” is the line perpendicular to the image plane that passes through the pinhole. Its itersection with the image plane is referred to as the “principal point,” illustrated below.

Without camera calibration, the estimated pose of the object would be incorrect due to the lens distortion, aspect ratio, and perspective effects introduced by the camera. Additionally, camera calibration helps to remove the effects of non-uniform illumination and other environmental factors that can affect the quality of the image data.

In the image below, the parameters of the lens estimated using geometric calibration were used to un-distort the image.

Effect of geometric correction on distorted image.

Camera calibration can be performed using a calibration pattern, such as a checkerboard pattern, and can be done offline or online, depending on the application’s requirements. Once the camera parameters are known, they can rectify the image, so the lines are parallel to the plane. These steps help improve the pose estimation accuracy and ensure that the estimated pose is reliable and repeatable. Click here to download calibration table mat.

You can learn more about camera calibration from the OpenCv documentation here.

Using the following code, you can complete the task of photographing and camera calibration. You first need to enter the camera serial number, in most cases it should be 0, if you can not open the USB camera, you can try to change to 1 or -1. Then, enter the number of photos you need to take, which is recommended to be around 20. Finally, you will get a file called ‘camera_params.npz’ which will store the parameters of the camera.

An Example
import cv2
import numpy as np
import os
import time
import shutil
import sys

# Set the chessboard size and grid width
num_horizontal = 11
num_vertical = 8
grid_width = 0.007    # meter

def capture_calibration_images(camera_id, num_images, delay=3):
    folder = 'cali_imgs'
    if os.path.exists(folder):
        # Delete all files in the folder
        for filename in os.listdir(folder):
            file_path = os.path.join(folder, filename)
            try:
                if os.path.isfile(file_path) or os.path.islink(file_path):
                    os.unlink(file_path)
                elif os.path.isdir(file_path):
                    shutil.rmtree(file_path)
            except Exception as e:
                print(f'Failed to delete {file_path}. Reason: {e}')
    else:
        # If the folder does not exist, create it
        os.makedirs(folder)

    cap = cv2.VideoCapture(camera_id)
    if not cap.isOpened():
        print("Failed to open the camera. Reason: {cap}")
        return
    
    photo_count = 0
    start_time = time.time()
    
    while cap.isOpened() and photo_count < num_images:
        ret, frame = cap.read()
        font = cv2.FONT_HERSHEY_SIMPLEX
        # Calculate the remaining time for the countdown (in seconds)
        current_time = time.time()
        elapsed_time = current_time - start_time
        remaining_time = delay - elapsed_time if elapsed_time < delay else 0

        # Calculate the end angle of the sector
        end_angle = 360 * remaining_time / delay

        # Create a copy of the original image with the same size
        overlay = np.zeros_like(frame)

        # Create a copy of the original image with the same size
        overlay = frame.copy()

        # Draw a white ellipse on the copy
        cv2.ellipse(overlay, (int(frame.shape[1]*0.5), int(frame.shape[0]*0.5)), (int(frame.shape[0]*0.5), int(frame.shape[0]*0.5)), -90, 0, end_angle, (255, 255, 255), -1)

        # Blend the copy and the original image with a transparency of 50%
        alpha = 0.5
        cv2.addWeighted(overlay, alpha, frame, 1 - alpha, 0, frame)
        cv2.putText(frame, f'Photo {photo_count + 1}/{num_images}', (frame.shape[1] - 200, frame.shape[0] - 10), font, 0.5, (0, 255, 0), 2, cv2.LINE_AA)
        cv2.imshow('Calibration', frame)

        # Break the loop if 'q' is pressed
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

        # Save the image every delay seconds
        if time.time() - start_time >= delay:
            filename = os.path.join(folder, f'image_{photo_count}.jpg')
            cv2.imwrite(filename, frame)
            print(f"Image saved:{filename}")
            photo_count += 1
            start_time = time.time()

            # After saving the image, display a white image
            white_frame = np.ones_like(frame) * 255
            cv2.imshow('Calibration', white_frame)
            cv2.waitKey(100)  
            cv2.imshow('Calibration', frame)

    cap.release()
    cv2.destroyAllWindows()

def calibrate_camera(folder):
    # termination criteria
    criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001)
    # prepare object points, like (0,0,0), (1,0,0), (2,0,0) ....,(10,7,0)
    objp = np.zeros((num_vertical*num_horizontal,3), np.float32)
    objp[:,:2] = np.mgrid[0:num_horizontal,0:num_vertical].T.reshape(-1,2) * grid_width
    # Arrays to store object points and image points from all the images.
    objpoints = [] # 3d point in real world space
    imgpoints = [] # 2d points in image plane.
    # Read images
    images = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith('.jpg')]
    images = sorted(images, key=lambda x: int(os.path.splitext(os.path.basename(x))[0].split('_')[1]))

    for fname in images:
        img = cv2.imread(fname)
        print("Image: ", fname)
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        # Find the chess board corners
        ret, corners = cv2.findChessboardCorners(gray, (num_horizontal, num_vertical), None)
        
        # If found, add object points, image points (after refining them)
        if ret == True:
            print('conership:', corners.shape[0])
            objpoints.append(objp)
            corners2 = cv2.cornerSubPix(gray, corners, (11,11), (-1,-1), criteria)
            imgpoints.append(corners2)
            # Draw and display the corners
            cv2.drawChessboardCorners(img, (num_horizontal, num_vertical), corners2, ret)
            cv2.imshow('img', img)
            cv2.waitKey(500)

    if not objpoints or not imgpoints:
        print("Not enough corners were detected for calibration")
        time.sleep(2)
        sys.exit(1)
    
    cv2.destroyAllWindows()

    # Calibrate the camera
    ret, mtx, dist, rvecs, tvecs = cv2.calibrateCamera(objpoints, imgpoints, gray.shape[::-1], None, None)

    # Calculate reprojection error
    mean_error = 0
    for i in range(len(objpoints)):
        imgpoints2, _ = cv2.projectPoints(objpoints[i], rvecs[i], tvecs[i], mtx, dist)
        error = cv2.norm(imgpoints[i],imgpoints2, cv2.NORM_L2)/len(imgpoints2)
        mean_error += error
    mean_error /= len(objpoints)
    print(f"Reprojection error: {mean_error}")

    # Check if calibration is qualified
    if mean_error < 1:
        np.savez("camera_params.npz", mtx=mtx, dist=dist, rvecs=rvecs, tvecs=tvecs)
        print("Camera calibrated")
    else:
        print("Calibration failed, please try again")

    return ret, mtx, dist, rvecs, tvecs

# Ask the user to input the camera ID
camera_id = int(input("Please enter your camera ID: "))
num_images = int(input("Please enter the number of photos you want to take: "))

capture_calibration_images(camera_id, num_images)

ret, mtx, dist, rvecs, tvecs = calibrate_camera('cali_imgs')
if ret:
    print("Calibration successful! Camera parameters are saved in 'camera_params.npz' file.")
    np.set_printoptions(precision=4, suppress=True)
    print('Camera Intrinsics Matrix:\n[[fx\t0\tcx]\n[0\tfy\tcy]\n[0\t0\t1]]  =\n', np.round(mtx, 4))
    print('dist:', np.round(dist, 4))

Task 2: Visual Object Tracking

  • In this part of the experiment, you will use the camera parameters obtained in the above steps and use them to estimate the 6D pose of ArUco marker and collect the data for further use.
  • ArUco marker is a type of fiducial marker that is commonly used in computer vision applications for object tracking and localization. It is a square-shaped marker with a black and white pattern that is designed to be easily detectable by a camera. The pattern consists of a binary matrix that encodes a unique identifier for each marker. ArUco markers come in different sizes and can be printed on paper or attached to an object. Learn more about ArUco marker.
  • The ArUco marker detection algorithm works by detecting the marker’s corners in the image using computer vision techniques such as edge detection and corner detection. Once the corners are detected, the algorithm can estimate the marker’s pose (position and orientation) in 3D space relative to the camera using a process called perspective-n-point (PnP) estimation. Here is the website where you can generate ArUco markers online.
Example of ArUco markers

NOTICE:

  1. You should finish Task1 first, because ‘camera_params.npz’ file is needed in the Task 2.
  2. Please check your OpenCv version. If you installed version 4.4.0 following the code we gave you last time, run the code on the left. If your version is newer (>=4.7.0), run the code on the right.
ArUco marker tracking (OpenCv version < 4.7.0)
import cv2
import numpy as np
import time
import cv2.aruco as aruco

# Define ArUco dictionary and parameters
arucoDict = aruco.getPredefinedDictionary(aruco.DICT_4X4_50)
arucoParams = aruco.DetectorParameters_create()

# Initialize camera capture
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

# Define an empty list to store the time series data
marker_poses = []

# Get the start time
start_time = time.time()

# Load the result from camera calibration
data = np.load('camera_params.npz')
mtx = data['mtx']
dist = data['dist']

while cap.isOpened():
    # Capture frame-by-frame
    ret, frame = cap.read()

    # Get the current time
    current_time = time.time()

    # Detect ArUco markers in the frame
    corners, ids, rejected = cv2.aruco.detectMarkers(frame, arucoDict, parameters=arucoParams)

    if ids is not None:
        # Loop over all detected markers
        for i in range(len(ids)):
            # Estimate the pose of the marker
            rvec, tvec, _ = cv2.aruco.estimatePoseSingleMarkers(corners[i], 20, mtx, dist)

            # Draw axis on the marker
            cv2.aruco.drawAxis(frame, mtx, dist, rvec, tvec, 10)

            # Get the time difference from the start of recognition
            time_diff = current_time - start_time

            # Add the ID and pose data to the time series
            marker_poses.append({'id': ids[i][0], 'rvec': rvec, 'tvec': tvec, 'time': time_diff})
            
            # Compute homogenous transformation matrix
            rmat = cv2.Rodrigues(rvec)[0]
            homogenous_trans_mtx = np.append(rmat, [[tvec[0][0][0]], [tvec[0][0][1]], [tvec[0][0][2]]], axis=1)
            homogenous_trans_mtx = np.append(homogenous_trans_mtx, [[0, 0, 0, 1]], axis=0)
            print('id: ', ids[i], 'time:', round(current_time-start_time, 3))
            print("homogenous_trans_matrix\n", np.array2string(homogenous_trans_mtx, precision=3, suppress_small=True))

    # Display the resulting frame
    cv2.imshow('frame', frame)

    # Exit on key press
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Save the time series data to file
filename = 'marker_poses.npz'
np.savez(filename, marker_poses=marker_poses)

# Release capture and destroy window
cap.release()
cv2.destroyAllWindows()
ArUco marker tracking (OpenCv version >= 4.7.0)
import cv2
import numpy as np
import time
import cv2.aruco as aruco

# Define aruco dictionary
arucoDict = aruco.getPredefinedDictionary(aruco.DICT_4X4_50)
arucoParams = aruco.DetectorParameters()

# Initialize camera capture
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

# Define an empty list to store the time series data
marker_poses = []

# Get the start time
start_time = time.time()

# Get the directory where the executable file is located
calibration_file = np.load('camera_params.npz')
intrinsic_camera = calibration_file['mtx']
distortion = calibration_file['dist']

while cap.isOpened():
    ret, frame = cap.read()
    if not ret: break

    # Get the current time
    current_time = time.time()

    # Detect aruco markers
    corners, ids, rejected = cv2.aruco.detectMarkers(frame, arucoDict, parameters=arucoParams)
    if ids is not None:
        for index in range(0, len(ids)):
            # Estimate the pose of the marker
            rvec, tvec, _ = cv2.aruco.estimatePoseSingleMarkers(corners[index], 20, intrinsic_camera,
                                                                    distortion)
            cv2.aruco.drawDetectedMarkers(frame, corners, ids)
            cv2.drawFrameAxes(frame, intrinsic_camera, distortion, rvec, tvec, 10)

            # Get the time difference from the start of recognition
            time_diff = current_time - start_time

            # Add the ID and pose data to the time series
            marker_poses.append({'id': ids[index][0], 'rvec': rvec, 'tvec': tvec, 'time': time_diff})
            
            # Compute homogenous transformation matrix
            rmat = cv2.Rodrigues(rvec)[0]
            homogenous_trans_mtx = np.append(rmat, [[tvec[0][0][0]], [tvec[0][0][1]], [tvec[0][0][2]]], axis=1)
            homogenous_trans_mtx = np.append(homogenous_trans_mtx, [[0, 0, 0, 1]], axis=0)
            print('id: ', ids[index], 'time:', round(current_time-start_time, 3))
            print("homogenous_trans_matrix\n", np.array2string(homogenous_trans_mtx, precision=3, suppress_small=True))
                
    cv2.imshow('frame', frame)

    # Exit the loop when q is pressed
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

# Save the time series data to file
filename = 'marker_poses.npz'
np.savez(filename, marker_poses=marker_poses)
print('Saved')

cap.release()
cv2.destroyAllWindows()

Here are the results of the program execution. The right image shows the 6D pose data of each ArUco marker, which has been constructed into homogeneous transformation matrices. This is a 4×4 matrix, where the first three rows and columns represent the rotation matrix, and the fourth column represents the translation matrix.

Part 2: Machine Learning Algorithms

  • Neural networks play a significant role in computer vision as they enable the development of advanced image analysis and interpretation algorithms. Neural networks are based on the idea of artificial neural networks and are designed to simulate the way the human brain works.
  • Neural networks have proven effective in solving complex computer vision tasks such as object recognition, image classification, semantic segmentation, and object detection. They can learn patterns and features in images and videos and use this knowledge to predict new data.
  • One of the main advantages of neural networks is their ability to handle high-dimensional data, such as images, and extract relevant features. They can also handle large amounts of data, making them well-suited for computer vision applications that require large amounts of training data.
  • In conclusion, using neural networks in computer vision has led to significant advancements. It has opened up new opportunities for developing more advanced and accurate computer vision systems.

Task 3: Linear Regression

  • Linear regression is a statistical method to model the relationship between a dependent variable and one or more independent variables. It’s a simple approach to modeling, where the relationship between the variables is represented as a linear equation.
  • In linear regression, the goal is to find the line of best fit that minimizes the difference between the observed values and the values predicted by the model. The line of best fit is represented by the equation of a line, which has the form: y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn, where y is the dependent variable, x1, x2, ..., xn are the independent variables, b0 is the y-intercept, and b1, b2, …, bn are the coefficients describing the relationship between each independent and dependent variable.
  • Linear regression is widely used in many fields, including finance, economics, and engineering. In computer vision, linear regression can be used for tasks such as image classification or object recognition, where the goal is to predict the class or label of an object based on its features.
  • This code uses the LinearRegression class from the scikit-learn library to perform linear regression. First, we generate the training data with 100 samples, each with six input features. Then, we create a LinearRegression object and fit it to the training data using the fit method. Finally, we use the trained model to predict new data with ten samples and six features.
Python
from sklearn.linear_model import LinearRegression
import numpy as np

# generate some random data with 6 input features and 6 output variables
X = np.random.rand(100, 6)
y = np.random.rand(100, 6)

# create a Linear Regression model and fit it to the data
model = LinearRegression()
model.fit(X, y)

# predict outputs for new data using the trained model
X_new = np.array([[1, 2, 3, 4, 5, 6]])
y_pred = model.predict(X_new)

print("Predicted outputs for new data:")
print(y_pred)

Task 4: Support Vector Machines

  • Support Vector Machine (SVM) is a type of supervised machine learning algorithm used for classification, regression and outlier detection. SVM works by mapping the input data into a higher dimensional space and finding the hyperplane that best separates the data into classes, or predicts the target value in regression, or identifies outliers. The SVM algorithm tries to maximize the margin between the two classes, which is defined as the distance between the hyperplane and the closest data points, known as support vectors. The support vectors are the most important data points in determining the position and orientation of the hyperplane. In the case of non-linearly separable data, SVM can be used with kernel methods to project the data into a higher dimensional space where a linear hyperplane can be used for separation.
  • This code generates random input and output data, trains an SVM classifier using a radial basis function (RBF) kernel, and then makes predictions on new input data.
Python
import numpy as np
from sklearn import svm

# Generate example input and output data
X = np.random.rand(100, 6)
y = np.random.randint(0, 6, 100)

# Create an SVM classifier using a radial basis function (RBF) kernel
clf = svm.SVC(kernel='rbf')

# Train the model on the input data
clf.fit(X, y)

# Predict the output values for nemw input data
new_input = np.array([[1, 2, 3, 4, 5, 6]])
prediction = clf.predict(new_input)

print("Prediction:", prediction)

Task 5: Multilayer Perceptron

  • A Multilayer Perceptron (MLP) is an artificial neural network commonly used for supervised learning. It consists of multiple layers of interconnected nodes (also called artificial neurons), each fully connected to the next. The MLP is called a “feedforward” neural network because the information flows in one direction, from input to output, with no loops or feedback connections. The input layer of the MLP receives data, which is then processed through the hidden layers, and the output layer produces the final output. Each node in the hidden and output layers applies a weighted sum of inputs and passes it through an activation function, introducing non-linearity into the model. The weights and biases of the MLP are learned through a process called backpropagation, which involves adjusting the weights to minimize the difference between the predicted output and the actual output. MLPs can be used for various tasks, including classification and regression. They have been successfully applied in many fields, such as computer vision, speech recognition, and natural language processing. However, they may suffer from overfitting and are sensitive to the choice of hyperparameters, such as the number of hidden layers and nodes in each layer.
Python
import numpy as np
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate random X and y datasets
np.random.seed(42)
X = np.random.rand(100, 6)
y = np.random.rand(100, 6)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the MLPRegressor model with 2 hidden layers, each with 50 neurons
mlp = MLPRegressor(hidden_layer_sizes=(50,50), max_iter=1000, random_state=42)

# Train the model on the training set
mlp.fit(X_train, y_train)

# Use the trained model to make predictions on the testing set
y_pred = mlp.predict(X_test)

# Calculate the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Report Requirement for Tutorial Session in Week 04

For Part 1

  • As a team, you need to record the experiment’s results in this part of the report. Since the code has been provided, you only need to modify the code parameters, take a screenshot, and fill in the blanks in the report template.

For Part 2

  • As a team, your assignment is to use the raw dataset provided here to build a simple model for estimating the force and torque of the soft finger network with the highest accuracy.
    • The experiment setup for collecting the raw dataset is as follows.
    • The data collection procedure is as follows.

Click here to download the raw dataset, which includes the following files. The input of your model should be the raw dataset, and your model’s output should be the three forces and three torques estimated. Code below may help you to deal with the dataset.

Python
import numpy as np
import cv2

# Define the filename of the npz file
filename = 'marker_poses_tactile.npz'

# Load the npz file with allow_pickle=True
data = np.load(filename, allow_pickle=True)

# Extract the marker poses time series data
marker_poses = data['marker_poses']

# Init pose
relativePoses = []
rvecsForMarker4 = []
rvecsForMarker5 = []
tvecsForMarker4 = []
tvecsForMarker5 = []
force = []

#########Code below is for you to see the structure for the data
# # Print the ID and pose data for each marker at each time point
# for i, marker_data in enumerate(marker_poses):
#     print(f'Time {marker_data["time"]} seconds:')
#     # Access the single integer value of the marker ID for this time point
#     marker_id = marker_data['id']
#     print(f'Marker {marker_id}:')
#     print(f'    rvec: {marker_data["rvec"][0][0]}')
#     print(f'    tvec: {marker_data["tvec"][0][0]}')
#     print(f'    forceSense: {marker_data["tactile"]}')
#########

for i, marker_data in enumerate(marker_poses):
    # calculate relative pose between markers
    marker_id = marker_data['id']
    if marker_id == 4:
        rvecsForMarker4 = marker_data["rvec"][0][0]
        tvecsForMarker4 = marker_data["tvec"][0][0]
    elif marker_id == 5:
        rvecsForMarker5 = marker_data["rvec"][0][0]
        tvecsForMarker5 = marker_data["tvec"][0][0]

        R1, _ = cv2.Rodrigues(rvecsForMarker4)
        R2, _ = cv2.Rodrigues(rvecsForMarker5)
        t1 = tvecsForMarker4.reshape(-1)
        t2 = tvecsForMarker5.reshape(-1)
        R_rel = np.dot(R2.T, R1)
        t_rel = np.dot(-R2.T, t1) + np.dot(R2.T, t2)

        # convert relative rotation matrix to rotation vector
        rvec_rel, _ = cv2.Rodrigues(R_rel)
        rvec_rel = np.array([rvec_rel[0][0],rvec_rel[1][0],rvec_rel[2][0]])

        # format relative pose as 6-dimensional array
        relativePose = np.concatenate((rvec_rel, t_rel)).reshape(1, 6)[0]

        relativePoses.append(relativePose)
        force.append(marker_data["tactile"])

# Finish your training and testing process here

  • The evaluation metric is mainly on the prediction accuracy, which should be as reasonably high as possible with the lowest cost.
  • You can use any model or programming language of your choice. Still, you must try more than three different setting for models and parameters and give a specific performance (accuracy, mean-square error, training time, etc.) analysis.

Report Template

Files Needed

  • The report should be in PDF format and renamed with your team number (example Team1-Lab1-Report.pdf)
  • Only the code files you use in Class 2 need to be submitted and named properly
  • All files are finally compressed into a zip file for submission and renamed with your team number (example Team1-Lab1.zip)

Deadline for Report Submission