Alex Staravoitau’s Blog

File system permissions and paths in iOS

2021-02-20T00:00:00+00:00

Although Juno makes coding on iPad a breeze, there are still some tricks you need to know — one of them is working with the file system and handling paths. For example, when your code is supposed to read file’s contents or write data to a file, how do you specify file’s location in iOS?

Jupyter client for iPad

2018-02-10T00:00:00+00:00

That’s why I thought that Jupyter is really missing a proper client iPad application with a native iOS interface, that would let you connect to a remote backend and work with Jupyter on your iPad — and finally, after months of making and beta testing my app Juno Connect has made it to the AppStore!

Juno Connect is a Jupyter Notebook client for iPad, which allows you to connect to an arbitrary remote Jupyter Notebook server, and do pretty much everything you do in desktop Jupyter on your iPad. It supports hardware keyboard, code completion driven by your server’s kernel and has a beautiful touch friendly interface, that feels much more natural than trying to access Jupyter through your iPad’s Safari browser. Actually, some reviews suggest it’s easier to work with Jupyter in Juno Connect rather than on desktop! 😉

Jupyter

I did cover Jupyter in my posts already, it’s an interactive cloud computing environment, where you can combine code execution, Markdown, LaTeX, plots and rich media. It supports over 40 programming languages (including Python, R, Julia and Scala) and most big data and machine learning tools.

Now, the most beautiful part is that code execution is separated from the development environment, which means that whenever you hit “Run” the hardware that actually executes your code and delivers the output can be anywhere where it can be reachable with a networking interface. Essentially, this means that with Juno Connect you can use your iPad to run code on a superpower computing cluster somewhere on another continent, and still receive output and feedback (including code completion suggestions!) in realtime. How awesome is that?

I did realise, however, that Jupyter may not be the most user-friendly tool to work with, so I tried to make sure that Juno Connect provides the easiest entry point to using Jupyter with two things: backend integrations and bundled introductory notebooks.

Backends

Jupyter can sometimes be tricky to setup for remote access. There are plenty of tutorials out there (including mine about configuring SSL), but some of them require additional knowledge of networking, command line interfaces and Unix systems. Luckily, there are cloud computing services that eliminate this by providing you a remote Jupyter Notebook environment out of the box, such as Azure Notebooks and CoCalc. Both have free tiers, although CoCalc also offers paid plans with less restricted access and better hardware.

What you get is a virtual server running Jupyter Notebook that you can access from anywhere in browser — or in Juno Connect as well! You can simply log in with your Microsoft or CoCalc account and access all your projects/libraries, and work with all your notebooks using Juno’s interface. It’s easier to think of it as a special preconfigured server that simply provides a computational backend for Juno Connect.

Bundled Notebooks

Even setting up an account with cloud computing service and trying to understand how Jupyter works can be a significant time investment for users not familiar with it. That’s why I have included a set of introductory notebooks that are available and runnable as soon as you download the app. They have plenty of sample code snippets and generated output (including stunning retina graphics), showing some of the amazing things you can do with Jupyter. Those notebooks are launched on temporary servers individually for each user, so any changes you make in these introductory notebooks will only appear for you, and will only persist until your server is restarted due to inactivity.

Under the hood Juno Connect uses Binder to launch these notebooks, Binder is a service that turns any GitHub repo into a collection of interactive notebooks by launching a temporary server for it. It works amazingly well, and I am planning to introduce a better integration with it in Juno Connect, essentially allowing users to launch any GitHub repo as a server right in the app.

Interface

I have spent quite some time trying to make the user interface touch- and iPad-friendly. I believe users have certain expectations in terms of UI when working with an iPad app, and writing code is something that hasn’t been tackled too often in other apps up until this point. So this has been quite a challenge, but I’m pretty happy with how it turned out eventually. It did take a couple of iterations (and a lot of feedback), but at least when it comes to notebook editing, the experience is much better now! What in my opinion makes Juno’s interface stand out is how it managed to declutter navigation panel using context actions and menus.

I would like to take this opportunity to thank all beta testers (more than 1200 of them!) who helped testing it and shared their feedback. Thank you once again, and I hope you will enjoy all the new things planned for Juno Connect in the coming year! Stay tuned. 😉

Self-signed SSL certificate in Jupyter

2017-09-01T00:00:00+00:00

In order to use Jupyter Notebook on iPad, one needs to correctly configure SSL certificates. Since issuing a proper certificate from a trusted authority could be challenging in some cases, a self-signed certificate should suffice, provided it was signed by a CA that is trusted by device. Follow these steps to get it working on your iPad!

Visualizing lidar data

2017-05-26T00:00:00+00:00

Although lidars used to be the most expensive components of self-driving cars, and could easily cost you as much as $75,000 just a couple of years ago, prices have plummeted recently and there are really good lidar sensors on the market in sub-$8000 range these days. And it just keeps getting better as Velodyne has just announced a whole magnitude cheaper model range with a limited field-of-view, presumably costing just under $1000.

Dataset

Luckily, you don’t have to spend that much money to get hold of data generated by a lidar. KITTI Vision Benchmark Suite contains datasets collected with a car driving around rural areas of a city — a car equipped with a lidar and a bunch of cameras, of course. Some of those datasets are labeled, e.g. they also contain information about objects around it; we will visualize those as well. These datasets are publicly available here, if you would like to follow along just go ahead and download one of them.

I will use the 2011_09_26_drive_0001 dataset and corresponding tracklets, e.g. labeled surrounding objects. It is one of the smallest datasets out there (0.4 GB) which contains data for just 11 seconds of driving:

Length: 114 frames (00:11 minutes)
Image resolution: 1392 x 512 pixels
Labels: 12 Cars, 0 Vans, 0 Trucks, 0 Pedestrians, 0 Sitters, 2 Cyclists, 1 Trams, 0 Misc

Dependencies

A lidar operates by streaming a laser beam at high frequencies, generating a 3D point cloud as an output in realtime. We are going to use a couple of dependencies to work with the point cloud presented in the KITTI dataset: apart from the familiar toolset of numpy and matplotlib we will use pykitti. In order to make tracklets parsing math easier we will use a couple of methods originally implemented by Christian Herdtweck that I have updated for Python 3, you can find them in source/parseTrackletXML.py in the project repo.

Visualization

Cameras

In addition to the lidar 3D point cloud data KITTI dataset also contains video frames from a set of forward facing cameras mounted on the vehicle. The regular camera data is not half as exciting as the lidar data, but is still worth checking out.

Sample frames from cameras

Camera frames look pretty straightforward: you can see a tram track on the right with a lonely tram far ahead and some parked cars on the left. Although those road features may seem obvious to detect to you, a computer vision algorithm would struggle to differentiate those by relying solely on the visual data.

Lidar

The dataset in question contains 114 lidar point cloud frames over duration of 11 seconds. This equals to approximately 10 frames per second, which is a very decent scanning rate, given that we get a 360° field-of-view with each frame containing approximately 120,000 points — a fair amount of data to stream in realtime. Not to clutter the visualizations we will randomly sample 20% of the points for each frame and discard the rest.

We will additionally visualize tracklets, e.g. labeled objects like cars, trams, pedestrians and so on. With a bit of math we will grab information from the KITTI tracklets file and work out each object’s bounding box for each frame, feel free to check out the notebook for more details. There are only 3 types of objects in this particular 11-seconds piece, we will mark them with bounding boxes as follows: cars will be marked in blue, trams in red and cyclists in green. Let’s first visualize a sample lidar frame on a 3D plot.

Sample lidar frame

Looks pretty neat! You can see the car with a lidar in the center of a black circle, with laser beams coming out of it. You can even see silhouettes of the cars parked on the left side of the road and tram tracks on the right! And of course bounding boxes for tram and cars, they seem to be exactly where you would expect them looking at the regular camera data. You might have also noticed that only the objects that are visible to the cameras are labeled.

Having this data as a point cloud is extremely useful, as it can be represented in various ways specific to particular applications. You could scale the data points over some particular axis, or simply discard one of the axes to create a plane projection of the point cloud. This is what this velodyne frame would look like when projected on XZ, XY and YZ planes respectively:

Projections of a sample lidar frame

Usually you can significantly improve your model performance by preprocessing the data. What you are trying to achieve is a reduction in dimensionality of the input, hoping to extract some useful features and remove those that would be redundant or slow down and confuse the model. In this particular case discarding Z coordinate seems like a promising path to explore, as it gives us pretty much a bird’s-eye view of the vehicle surroundings. With a more sophisticated feature-engineering coupled with regular camera data as an additional input, you could achieve decent performance on detecting and classifying surrounding objects.

Finally, let’s plot all 114 sequential frames and combine them into a short video representing how point cloud changes over time.

Lidar data plotted over time

This should give a much better idea of what lidar data looks like. You can clearly see silhouettes of trees and parked cars that our vehicle is passing by — now that would be much easier for an algorithm to interpret. And although lidar is usually used in conjunction with a bunch of other sensors and data sources, it plays a significant role in vehicle simultaneous localization and mapping.

Follow @alexstaravoitau Star Fork Download

Detecting road features

2017-03-06T00:00:00+00:00

We are going to try detecting and tracking some basic road features in a video stream from a front-facing camera on a vehicle, this is clearly a very naive way of doing it and can hardly be applied in the field, however it is a good representation of what we can detect using mainly computer vision techniques: e.g. fiddling with color spaces and various filters. We will cover tracking of the following features:

Lane boundaries. Understanding where the lane is could be useful in many applications, be it a self-driving car or some driving assistant software.
Surrounding vehicles. Keeping track of other vehicles around you is just as important if you were to implement some collision-avoiding algorithm.

We will implement it in two major steps, first we will prepare a pipeline for lane tracking, and will then learn how to detect surrounding vehicles.

Road features detection is one of the assignments in Udacity Self-Driving Car Nanodegree program, however the concepts described here should be easy to follow even without that context.

Source video

I am going to use a short video clip shot from a vehicle front-facing camera while driving on a highway. It was shot in close to perfect conditions: sunny weather, not many vehicles around, road markings clearly visible, etc. — so using just computer vision techinques alone should be sufficient for a quick demonstration. You can check out the full 50 seconds video here.

Source video

Lane Tracking

Let’s first prepare a processing pipeline to identify the lane boundaries in a video. The pipeline includes the following steps that we apply to each frame:

Camera calibration. To cater for inevitable camera distortions, we calculate camera calibration using a set of calibration chessboard images, and applying correction to each of the frames.
Edge detection with gradient and color thresholds. We then use a bunch of metrics based on gradients and color information to highlight edges in the frame.
Perspective transformation. To make lane boundaries extraction easier we apply a perspective transformation, resulting in something similar to a bird’s eye view of the road ahead of the vehicle.
Fitting boundary lines. We then scan resulting frame for pixels that could belong to lane boundaries and try to approximate lines into those pixels.
Approximate road properties and vehicle position. We also provide a rough estimate on road curvature and vehicle position within the lane using known road dimensions.

Camera calibration

We are going to use some heavy image warping on later stages, which would make any distortions introduced by the camera lense very apparent. So in order to cater for that we will introduce a camera correction step based on a set of calibration images shot with the same camera. A very common techinque would be shooting a printed chessboard from various angles and calculating the distortions introduced by the camera based on the expected chessboard orientation in the photo.

We are going to use a number of OpenCV routines in order to apply correction for camera distortion. I first prepare a pattern variable holding object points in (x, y, z) coordinate space of the chessboard, which are essentially inner corners of the chessboard. Here x and y are horizontal and vertical indices of the chessboard squares, and z is always 0 (as chessboard inner corners lie in the same plane). Those object points are going to be the same for each calibration image, as we expect the same chessboard in each.

pattern = np.zeros((pattern_size[1] * pattern_size[0], 3), np.float32)
pattern[:, :2] = np.mgrid[0:pattern_size[0], 0:pattern_size[1]].T.reshape(-1, 2)

We then use cv2.findChessboardCorners() function to get coordinates of the corresponding corners in each calibration image.

pattern_points = []
image_points = []
found, corners = cv2.findChessboardCorners(image, (9, 6), None)
if found:
    pattern_points.append(pattern)
    image_points.append(corners)

Once we have collected all the points from each image, we can compute the camera calibration matrix and distortion coefficients using the cv2.calibrateCamera() function.

_, self.camera_matrix, self.dist_coefficients, _, _ = cv2.calibrateCamera(
    pattern_points, image_points, (image.shape[1], image.shape[0]), None, None
)

Now that we have camera calibration matrix and distortion coefficients we can use cv2.undistort() to apply camera distortion correction to any image.

corrected_image = cv2.undistort(image, self.camera_matrix, self.dist_coefficients, None, self.camera_matrix)

As some of the calibration images did not have chessboard fully visible, we will use one of those for verifying aforementioned calibration pipeline.

Original vs. calibrated images

For implementation details check CameraCalibration class in lanetracker/camera.py.

Edge detection

We use a set of gradient and color based thresholds to detect edges in the frame. For gradients we use Sobel operator, which essentially highlights rapid changes in color over either of two axes by approximating derivatives using a simple convolution kernel. For color we simply convert the frame to HLS color space and apply a threshold on the S channel. The reason we use HLS here is because it proved to perform best in separating light pixels (road markings) from dark pixels (road) using the saturation channel.

Gradient absolute value. For absolute gradient value we simply apply a threshold to `cv2.Sobel() output for each axis.

sobel = np.absolute(cv2.Sobel(image, cv2.CV_64F, 1, 0, ksize=3))

Gradient magnitude. Additionaly we include pixels within a threshold of the gradient magnitude.

sobel_x = cv2.Sobel(image, cv2.CV_64F, 1, 0, ksize=3)
sobel_y = cv2.Sobel(image, cv2.CV_64F, 0, 1, ksize=3)
magnitude = np.sqrt(sobel_x ** 2 + sobel_y ** 2)

Gradient direction. We also include pixels that happen to be withing a threshold of the gradient direction.

sobel_x = cv2.Sobel(image, cv2.CV_64F, 1, 0, ksize=3)
sobel_y = cv2.Sobel(image, cv2.CV_64F, 0, 1, ksize=3)
direction = np.arctan2(np.absolute(sobel_y), np.absolute(sobel_x))

Color. Finally, we extract S channel of image representation in the HLS color space and then apply a threshold to its absolute value.

hls = cv2.cvtColor(np.copy(image), cv2.COLOR_RGB2HLS).astype(np.float)
s_channel = hls[:, :, 2]

We apply a combination of all these filters as an edge detection pipeline. Here is an example of its output, where pixels masked by color are blue, and pixels masked by gradient are green.

Original vs. highlighted edges

For implementation details check functions in lanetracker/gradients.py.

Perspective transform

It would be much easier to detect lane boundaries if we could get hold of a bird’s eye view of the road, and we can get something fairly close to it by applying a perspective transform to the camera frames. For the sake of this demo project I manually pin-pointed source and destination points in the camera frames, so perspective transform simply maps the following coordinates.

Source	Destination	Position
`(564, 450)`	`(100, 0)`	Top left corner.
`(716, 450)`	`(1180, 0)`	Top right corner.
`(-100, 720)`	`(100, 720)`	Bottom left corner.
`(1380, 720)`	`(1180, 720)`	Bottom right corner.

The transformation is applied using cv2.getPerspectiveTransform() function.

(h, w) = (image.shape[0], image.shape[1])
source = np.float32([[w // 2 - 76, h * .625], [w // 2 + 76, h * .625], [-100, h], [w + 100, h]])
destination = np.float32([[100, 0], [w - 100, 0], [100, h], [w - 100, h]])
transform_matrix = cv2.getPerspectiveTransform(source, destination)
image = cv2.warpPerspective(image, transform_matrix, (w, h))

This is what it looks like for an arbitrary test image.

Original vs. bird’s eye view

For implementation details check functions in lanetracker/perspective.py.

Detect boundaries

We are now going to scan the resulting frame from bottom to top trying to isolate pixels that could be representing lane boundaries. What we are trying to detect is two lines (each represented by Line class) that would make up lane boundaries. For each of those lines we have a set of windows (represented by Window class). We scan the frame with those windows, collecting non-zero pixels within window bounds. Once we reach the top, we try to fit a second order polynomial into collected points. This polynomial coefficients would represent a single lane boundary.

Here is a debug image representing the process. On the left is the original image after we apply camera calibration and perspective transform. On the right is the same image, but with edges highlighted in green and blue, scanning windows boundaries highlighted in yellow, and a second order polynomial approximation of collected points in red.

Boundary detection pipeline

For implementation details check LaneTracker class in lanetracker/tracker.py, Window class in lanetracker/window.py and Line class in lanetracker/line.py.

Approximate properties

We can now approximate some of the road properties and vehicle spacial position using known real world dimensions. Here we assume that the visible vertical part of the bird’s eye view warped frame is 27 meters, based on the known length of the dashed lines on american roads. We also assume that lane width is around 3.7 meters, again, based on american regulations.

ym_per_pix = 27 / 720  # meters per pixel in y dimension
xm_per_pix = 3.7 / 700  # meters per pixel in x dimension

Road curvature

Previously we approximated each lane boundary as a second order polynomial curve, which can be represented with the following equation.

Second order polynomial

As per this tutorial, we can get the radius of curvature in an arbitrary point using the following equation.

Radius equation

If we calculate actual derivatives of the second order polynomial, we get the following.

Radius equation

Therefore, given x and y variables contain coordinates of points making up the curve, we can get curvature radius as follows.

# Fit a new polynomial in real world coordinate space
poly_coef = np.polyfit(y * ym_per_pix, x * xm_per_pix, 2)
radius = ((1 + (2 * poly_coef[0] * 720 * ym_per_pix + poly_coef[1]) ** 2) ** 1.5) / np.absolute(2 * poly_coef[0])

Vehicle position

We can also approximate vehicle position within the lane. This rountine would calculate an approximate distance to a curve at the bottom of the frame, given that x and y contain coordinates of points making up the curve.

(h, w, _) = frame.shape
distance = np.absolute((w // 2 - x[np.max(y)]) * xm_per_pix)

For implementation details check Line class in lanetracker/line.py.

Sequence of frames

We can now try to apply the whole pipeline to a sequence of frames. We will use an approximation of lane boundaries detected over last 5 frames in the video using a deque collection type. It will make sure we only store last 5 boundary approximations.

from collections import deque

coefficients = deque(maxlen=5)

We then check if we detected enough points (x and y arrays of coordinates) in the current frame to approximate a line, and append polynomial coefficients to coefficients. The sanity check here is to ensure detected points span over image height, otherwise we wouldn’t be able to get a reasonable line approximation.

if np.max(y) - np.min(y) > h * .625:
    coefficients.append(np.polyfit(y, x, 2))

Whenever we want to draw a line, we get an average of polynomial coefficients detected over last 5 frames.

mean_coefficients = np.array(coefficients).mean(axis=0)

This approach proved iself to work reasonably well, you can check out the full annotated video here.

Sample of the annotated project video

For implementation details check LaneTracker class in lanetracker/tracker.py.

Vehicle Tracking

We are going to use a bit of machine learning to detect vehicle presence in an image by training a classifer that would classify an image as either containing or not containing a vehicle. We will train this classifer using a dataset provided by Udacity which comes in two separate archives: images containing cars and images not containing cars. The dataset contains 17,760 color RGB images 64×64 px each, with 8,792 samples labeled as containing vehicles and 8,968 samples labeled as non-vehicles.

Random sample labeled as containing cars

Random sample of non-cars

In order to prepare a processing pipeline to identify surrounding vehicles, we are going to break it down into the following steps:

Extract features and train a classifier. We need to identify features that would be useful for vehicle detections and prepare a feature extraction pipeline. We then use it to train a classifier to detect a car in individual frame segment.
Apply frame segmentation. We then segment frame into windows of various size that we run through the aforementioned classifier.
Merge individual segment detections. As there will inevitably be multiple detections we merge them together using a heat map, which should also help reducing the number of false positives.

Feature extraction

After experimenting with various features I settled on a combination of HOG (Histogram of Oriented Gradients), spatial information and color channel histograms, all using YCbCr color space. Feature extraction is implemented as a context-preserving class (FeatureExtractor) to allow some pre-calculations for each frame. As some features take a lot of time to compute (looking at you, HOG), we only do that once for entire image and then return regions of it.

Histogram of Oriented Gradients

I had to run a bunch of experiments to come up with final parameters, and eventually I settled on HOG with 10 orientations, 8 pixels per cell and 2 cells per block. The experiments went as follows:

Train and evaluate the classifier for a wide range of parameters and identify promising smaller ranges.
Train and evaluate the classifier on those smaller ranges of parameters multiple times for each experiment and assess average accuracy.

The winning combination turned out to be the following:

 orient     px/cell    clls/blck  feat-s     iter       acc        sec/test  
       8          2          5880       0          0.982      0.01408   
       8          2          5880       1          0.9854     0.01405   
       8          2          5880       2          0.9834     0.01415   
       8          2          5880       3          0.9825     0.01412   
       8          2          5880       4          0.9834     0.01413   
Average accuracy = 0.98334

This is what Histogram of Oriented Gradients looks like applied to a random dataset sample.

Original (Y channel of YCbCr color space)

HOG (Histogram of Oriented Gradients)

Initial calculation of HOG for entire image is done using hog() function in skimage.feature module. We concatenate HOG features for all color channels.

(h, w, d) = image.shape
hog_features = []
for channel in range(d):
    hog_features.append(
        hog(
            image[:, :, channel], 
            orientations=10, 
            pixels_per_cell=(8, 8),
            cells_per_block=(2, 2), 
            transform_sqrt=True,
            visualise=False, 
            feature_vector=False
        )
    )
hog_features = np.asarray(hog_features)

This allows us to get features for an individual image window by calculating HOG array offsets, given that x is the window horizontal offset, y is the vertical offset and k is the size of the window (single value, side of a square region).

hog_k = (k // 8) - 1
hog_x = max((x // 8) - 1, 0)
hog_x = hog_features.shape[2] - hog_k if hog_x + hog_k > hog_features.shape[2] else hog_x
hog_y = max((y // 8) - 1, 0)
hog_y = hog_features.shape[1] - hog_k if hog_y + hog_k > hog_features.shape[1] else hog_y
region_hog = np.ravel(hog_features[:, hog_y:hog_y+hog_k, hog_x:hog_x+hog_k, :, :, :])

Spatial information

For spatial information we simply resize the image to 16×16 and flatten to a 1-D vector.

spatial = cv2.resize(image, (16, 16)).ravel()

Color channel histogram

We additionally use individual color channel histogram information, breaking it into 16 bins within (0, 256) range.

color_hist = np.concatenate((
    np.histogram(image[:, :, 0], bins=16, range=(0, 256))[0],
    np.histogram(image[:, :, 1], bins=16, range=(0, 256))[0],
    np.histogram(image[:, :, 2], bins=16, range=(0, 256))[0]
))

`FeatureExtractor`

The way FeatureExtractor class works is that you initialise it with a single frame, and then request a feature vector for individual regions. In this case it only calculates computationally expensive features once. You then call feature_vector() method to get a concatenated combination of HOG, spatial and color histogram feature vectors.

extractor = FeatureExtractor(frame)

# Feature vector for entire frame
feature_vector = extractor.feature_vector()

# Feature vector for a 64×64 frame region at (0, 0) point
feature_vector = extractor.feature_vector(0, 0, 64)

For implementation details check FeatureExtractor class in vehicletracker/features.py.

Training a classifier

I trained a Linear SVC (sklearn implementation), using feature extractor described above. Nothing fancy here, I used sklearn’s train_test_split to split the dataset into training and validation sets, and used sklearn’s StandardScaler for feature scaling. I didn’t bother with a proper test set, assuming that classifier performance on the project video would be a good proxy for it.

For implementation details check `detecting-road-features.ipynb notebook.

Frame segmentation

I use a sliding window approach with a couple of additional constraints. For instance, we can approximate vehicle size we expect in different frame regions, which makes searching a bit easier.

Window size varies across scanning locations

Since frame segments must be of various size, and we eventually need to use 64×64 regions as a classifier input, I decided to simply scale the frame to various sizes and then scan them with a 64×64 window. This can be roughly encoded as follows.

# Scan with 64×64 window across 8 differently scaled images, ranging from 30% to 80% of the original frame size. 
for (scale, y) in zip(np.linspace(.3, .8, 4), np.logspace(.6, .55, 4)):
    # Scale the original frame
    scaled = resize(image, (image.shape[0] * scale, image.shape[1] * scale, image.shape[2]))
    # Prepare a feature extractor
    extractor = FeatureExtractor(scaled)
    (h, w, d) = scaled.shape
    s = 64 // 3
    # Target stride is no more than s (1/3 of the window size here), 
    # making sure windows are equally distributed along the frame width.
    for x in np.linspace(0, w - k, (w + s) // s):
        # Extract features for current window.
        features = extractor.feature_vector(x, h*y, 64)
        # Run features through a scaler and classifier and add window coordinates 
        # to `detections` if classified as containing a vehicle
        ...

Merging multiple detections

As there are multiple detections on different scales and overlapping windows, we need to merge nearby detections. In order to do that we calculate a heatmap of intersecting regions that were classified as containing vehicles.

heatmap = np.zeros((image.shape[0], image.shape[1]))
# Add heat to each box in box list
for c in detections:
    # Assuming each set of coordinates takes the form (x1, y1, x2, y2)
    heatmap[c[1]:c[3], c[0]:c[2]] += 1
# Apply threshold to help remove false positives
heatmap[heatmap < threshold] = 0

Then we use label() function from scipy.ndimage.measurements module to detect individual groups of detections, and calculate a bounding rect for each of them.

groups = label(heatmap)
detections = np.empty([0, 4])
# Iterate through all labeled groups
for group in range(1, groups[1] + 1):
    # Find pixels belonging to the same group
    nonzero = (groups[0] == group).nonzero()
    detections = np.append(
        detections,
        [[np.min(nonzero[1]), np.min(nonzero[0]), np.max(nonzero[1]), np.max(nonzero[0])]],
        axis=0
    )

Merging detections with a heat map

Sequence of frames

Working with video allowes us to use a couple of additional constraints, in a sense that we expect it to be a stream of consecutive frames. In order to eliminate false positives I, again, use deque collection type in order to accumulate detections over last N frames instead of classifying each frame individually. And before returning a final set of detected regions I run those accumulated detections through the heatmap merging process once again, but with a higher detection threshold.

detections_history = deque(maxlen=20)

def process(frame):
    ...
    # Scan frame with windows through a classifier
    ...
    # Merge detections
    ...
    # Add merged detections to history
    detections_history.append(detections)

def heatmap_merge(detections, threshold):
    # Calculate heatmap for detections
    ...
    # Apply threshold
    ...
    # Merge detections with `label() 
    ...
    # Calculate bounding rects
    ...

def detections():
    return heatmap_merge(
        np.concatenate(np.array(detections_history)),
        threshold=min(len(detections_history), 15)
    )

This approach proved iself to work reasonably well on the source video, you can check out the full annotated video here. There is the current frame heat map in the top right corner — you may notice quite a few false positives, but most of them are eliminated by merging detections over the last N consecutive frames.

Sample of the annotated project video

For implementation details check VehicleTracker class in vehicletracker/tracker.py.

Results

This clearly is a very naive way of detecting and tracking road features, and wouldn’t be used in real world application as-is, since it is likely to fail in too many scenarios:

Going up or down the hill.
Changing weather conditions.
Worn out lane markings.
Obstruction by other vehicles or vehicles obstructing each other.
Vehicles and vehicle positions different from those classifier was trained on.
…

Not to mention it is painfully slow and would not run in real time without substantial optimisations. Nevertheless this project is a good representation of what can be done by simply inspecting pixel values’ gradients and color spaces. It shows that even with these limited tools we can extract a lot of useful information from an image, and that this information can potentially be used as a feature input to more sophisticated algorithms.

Follow @alexstaravoitau Star Fork Download

Meet Fenton (my data crunching machine)

2017-02-25T00:00:00+00:00

As you might be aware, I have been experimenting with AWS as a remote GPU-enabled machine for a while, configuring Jupyter Notebook to use it as a backend. It seemed to work fine, although costs did build over time, and I had to always keep in mind to shut it off, alongside with a couple of other limitations. Long story short, around 3 months ago I decided to build my own machine learning rig.

My idea in a nutshell was to build a machine that would only act as a server, being accessible from anywhere to me, always ready to unleash its computational powers on whichever task I’d be working on. Although this setup did take some time to assess, assemble and configure, it has been working flawlessly ever since, and I am very happy with it.

Hardware

Let’s start with hardware. This would include the server PC and some basic peripherals: I didn’t even bother to buy a monitor or a mouse, as I only intended to use this machine remotely from CLI. My main considerations were performance in machine learning tasks and extensibility in case I decided to upgrade at some point. This is the config I came up with.

Type	Item	Price
Video Card	EVGA GeForce GTX 1080 8GB Superclocked Gaming ACX 3.0 Video Card	£629.84
Motherboard	Asus Z170-PRO ATX LGA1151 Motherboard	£129.99
CPU	Intel Core i5-6400 2.7GHz Quad-Core Processor	£161.99
Memory	Corsair Vengeance LPX 32GB (2 × 16GB) DDR4-3200 Memory	£182.86
Storage	Samsung 850 EVO-Series 1TB 2.5” Solid State Drive	£295.98
Power Supply	EVGA SuperNOVA G2 650W 80+ Gold Certified Fully-Modular ATX Power Supply	£89.99
Case	NZXT S340 (White) ATX Mid Tower Case	£59.98
Keyboard	Microsoft ANB-00006 Wired Slim Keyboard	£11.63
Total		£1562.26

Let’s break this list down and I will elaborate on some of the choices I made.

Video Card

This is the most crucial part. After serious consideration and leveraging the budget I decided to invest into EVGA GeForce GTX 1080 8GB card backed by Nvidia GTX 1080 GPU. It is really snappy (and expensive), and in this particular case it only takes 15 minutes to run — 3 times faster than a g2.2xlarge AWS machine! If you still feel hesitant, think of it this way: the faster your model runs, the more experiments you can carry out over the same period of time.

Motherboard

ASUS Z170 Pro had some nice reviews, and, most importantly, is capable of handling a maximum of two massive GPUs like GTX 1080. Yes, GTX 1080 is pretty large and is going to take 2 PCI slots on your motherboard — something to keep in mind if you plan to stack them in future. Asus Z170 even supports SLI, although you wouldn’t need it if you are only using GPUs for machine learning tasks. It supports a maximum 64 Gb of RAM which should also be enough if I decide to upgrade.

CPU

This part was easy. I simply went with what was not too expensive, and didn’t pursue any outstanding computational power here — this happened to be Intel Core i5-6400 at the moment. I was thinking of buying a neat and quiet Noctua cooler at first, but the stock one seems to just do the job and is pretty quiet as well, so I never bothered to replace it.

RAM

I went with 32GB (2 × 16GB) DDR4-3200, although it actually works at a lower clock rate. However, the important part was to get 2 × 16 Gb modules, so that they only occupy 2 out of 4 available motherboard slots. In this case whenever I realise I need more RAM, I can simply get 2 more memory modules and bump it up to 64 Gb.

Storage

I decided to go with a Samsung 1 TB SSD for a system drive, and that is where OS would go. However currently I use it for everything, and still have an option of adding an additional 4-6 Tb HDD when I start working with fairly large datasets.

Power supply

Since my machine was supposed to be a server, it would be plugged in all the time. EVGA SuperNOVA G2 650W has an automatic eco mode for times when you don’t use all of the machine’s power, and is 80+ Gold Certified. Thinking about it now, it would make sense to go up to a 850W for potential upgrades, but 650W is more than enough for now. I would also highly recommend fully-modular power supplies as they are so much eaiser to install.

Case

Main consideration here was to have a case that would support a potential upgrade, e.g. could fit the motherboard I decided to go with. NZXT S340 ATX Mid Tower Case however turned out to be a pretty good choice in terms of cable management and looks!

Putting it together

It took me a couple of hours to put everything together, but in my defense I never did anything like that before, so it would probably take you less if you are familiar with the process. Overall it is a pretty straightforward job, and it seemed like it would take some effort to screw things up big time.

Now, what I like most about this setup is a room for extension. If at some point I decide that it is not enough for my needs, there are a bunch of things I can improve by simply plugging something in, rather than replacing:

Install 32Gb more RAM, resulting in 64 Gb altogether.
Install additional storage with a 4-6 Tb HDD.
Install another GPU, resulting in 2 × GTX 1080 setup.

Software

Operating System

It was supposed to be a server and it had to support all the modern machine learning libraries and frameworks, so I decided to go with Ubuntu 16.04 as an operating system. It has a nice CLI, and I am familiar with Unix systems as I have macOS installed on my personal computer. I then installed most of the required frameworks and libraries with Anaconda (apart from CUDA dependencies and TensorFlow), and it was time to make my server accessible.

SSH

The easiest way to get hold of your server from other machine is by configuring SSH access with a key. The process is fairly straightforward and is explained in great detail here, and if you are less familiar with communication commands in Linux, you may want to check out this course. Basically, you want your server to allow SSH connections, authenticating users with a key pair. You generate this key pair on your primary machine (the one you connect from) keeping your private key private, and transfering corresponding public key to the server. You then tell the server that this is your public key and whoever knocks with a corresponding private key must be you.

Now, all of this must now work while you are in the same local network. If you want to make it accessible to the outside world though, you may need to request a static IP from your provider, or install some sort of a dynamic DNS daemon on your server (there are a couple of free services that allow that). You may also want to check your router settings first, as some of them support dynamic DNS services out of the box. Once you get hold of your machine’s domain name or IP, you can open a random port for SSH access in your router settings (and one for the Jupyter Notebook to broadcast its frontend). This is basically all it takes to make your server accessible from anywhere in the world (and this is why it is essential to secure your server with a key).

Don’t forget to set SSH keys! Exposing your server to the outside world is dangerous, for internet is dark and full of terrors. You don’t want those wildlings to hack into your machine.

SSH File system

Although command line may seem like a user-friendly interface to some, there is an alternative way of accessing your server’s file system called SSH File system. It allows you to mount a portion of a file system on your remote machine to a local folder. Coolest thing about it is that once it is mounted, you can use any sofware you like to work with these mounted folders, be it IDE or your favourite GUI git client. Things will definitely seem slower, but should overall work just as if you had all those remote files locally.

If your user on the server machine happens to be tom and server’s IP is 10.23.45.67, this would mount your entire server home directory to ~/your/mount/folder/ on your local machine.

sshfs -o delay_connect,reconnect,ServerAliveInterval=5,ServerAliveCountMax=3,allow_other,defer_permissions,IdentityFile=/local/path/to/private/key tom@10.23.45.67:/home/tom ~/your/mount/folder/

Here /local/path/to/private/key is, well, your local path to the private key for SSH access. Keep an eye on all those settings as they are supposed to make remote partition more stable in terms of retaining connection. Finally, this is how you unmount your server file system.

umount tom@10.23.45.67:/ &> /dev/null

Disclaimer: Keep in mind that many operations may seem way slower in macOS Finder as opposed to ssh-ing into the machine and using CLI. For instance, if you want to unzip an archive with a lot of files (say, a dataset) which is physically stored on your server, you may be tempted to open enclosing folder in Finder and open with Archive Utility. However this would be painfully slow, and a much faster way to do that would be this (see code below).

# Way faster than double-click in Finder
ssh tom@10.23.45.67
sudo apt-get install unzip
cd ~/path/to/archive/folder/
unzip archive.zip

Jupyter Notebook

Jupyter already has this amazingly flexible concept of using web pages as a frontend, essentially allowing to run its backend anywhere. Setup and configuration were mentioned in this post, however you may want to take it one step further and make sure Jupyter is running even if you disconnect from your server. I use iTerm as a terminal in macOS, which supports tmux sessions out of the box, which allows me to simply do the following to connect to a long-living SSH session.

ssh -t tom@10.23.45.67 'tmux -CC attach'

This would present a window attached to a tmux session, where you can start Jupyter Notebook server.

jupyter notebook

You can now close the window — Jupyter process will always stay there, whether you are connected to the remote machine over SSH or not. And, of course, you can always get back to it by attaching to the same tmux session.

Don’t forget to set password! A wise thing to do would be configuring a password for Jupyter’s web interface access. Make sure to check out my AWS post where I describe it in more detail.

PyCharm

PyCharm is my favourite Python IDE, PyCharm Community Edition is free but doesn’t support remote interpreters unfortunately, however PyCharm Professional does (and is not too expensive). You need to go through a cumbersome configuration of your project (which is described in depth here), but as a result you can work with your source code locally, and run it with a remote interpreter, leaving automatic syncing and deployment to PyCharm.

Monitoring

Finally, I suggest installing a monitoring daemon on your remote machine, so that you can periodically check useful stats like CPU load, memory consumption, disk and network activity, etc. Ideally you want to monitor your GPU sensors as well, however I didn’t find any daemon-like monitoring software allowing that on Ubuntu — maybe you will have better luck with it.

What I decided to go with was iStat, which works with a wide range of sensors (Nvidia GPU sensor is not on the list unfortunately) and has a nice companion iOS app. This is what the training process looks like, for instance: CPU is busy with some heavy on-the-go data augmentation, so you can see iStat’s CPU load graph exposing training epochs spikes.

Pick a name

Arguably the most important step is picking your machine’s name. I named mine after this famous dog, probably because when making my first steps in data science, whenever my algorithm failed to learn I felt just as desperate and helpless as Fenton’s owner. Fortunately, this happens less and less often these days!

Fenton is a good ~~bot~~ boy, sending me messenger notifications when it finishes training

I also wrote a tiny shell script to make connecting to a remote machine easier. It allows to SSH into it, mount its file system, or attach to a tmux session.

Update user/server/path settings, put this file to /usr/local/bin and make it executable.

# Make fenton.sh executable
chmod +x fenton.sh

You may also want to remove file extension to do less typing in the CLI. Here is a list of available commands.

Command	Description
`fenton`	Connects via SSH.
`fenton -fs`	Mounts remote machine file system to `LOCAL_MOUNT_PATH`
`fenton -jn`	Attaches to a persistent tmux session, where I typically have my Jupyter Notebook running.
`fenton jesus christ`	Couldn’t resist adding this one. Opens the Fenton video on YouTube.

You are all set! Having your own dedicated machine allows you to do incredible things, like kicking off a background training job that is expected to run for hours or days, periodically checking on it. You could even receive notifications and updates on how the training is going using my cloud logger! The main thing however is that you don’t need to worry anymore that your personal computer is not powerful enough for machine learning tasks, since there is a ton of computational power always accessible to you from anywhere in the world.

End-to-end learning for self-driving cars

2017-02-05T00:00:00+00:00

I’m assuming you already know a fair bit about neural networks and regularization, as I won’t go into too much detail about their background and how they work. I am using Keras with TensorFlow backend as a ML framework and a couple of dependancies like numpy, pandas and scikit-image. You may want to check out code of the final solution I am describing in this tutorial, however keep in mind that if you would like to follow along, you may as well need a machine with a CUDA-capable GPU.

Training a model to drive a car in a simulator is one of the assignments in Udacity Self-Driving Car Nanodegree program, however the concepts described here should be easy to follow even without that context.

Dataset

The provided driving simulator had two different tracks. One of them was used for collecting training data, and the other one — never seen by the model — as a substitute for test set.

Data collection

The driving simulator would save frames from three front-facing “cameras”, recording data from the car’s point of view; as well as various driving statistics like throttle, speed and steering angle. We are going to use camera data as model input and expect it to predict the steering angle in the [-1, 1] range.

I have collected a dataset containing approximately 1 hour worth of driving data around one of the given tracks. This would contain both driving in “smooth” mode (staying right in the middle of the road for the whole lap), and “recovery” mode (letting the car drive off center and then interfering to steer it back in the middle).

Balancing dataset

Just as one would expect, resulting dataset was extremely unbalanced and had a lot of examples with steering angles close to 0 (e.g. when the wheel is “at rest” and not steering while driving in a straight line). So I applied a designated random sampling which ensured that the data is as balanced across steering angles as possible. This process included splitting steering angles into n bins and using at most 200 frames for each bin:

df = read_csv('data/driving_log.csv')

balanced = pd.DataFrame()   # Balanced dataset
bins = 1000                 # N of bins
bin_n = 200                 # N of examples to include in each bin (at most)

start = 0
for end in np.linspace(0, 1, num=bins):  
    df_range = df[(np.absolute(df.steering) >= start) & (np.absolute(df.steering) < end)]
    range_n = min(bin_n, df_range.shape[0])
    balanced = pd.concat([balanced, df_range.sample(range_n)])
    start = end
balanced.to_csv('data/driving_log_balanced.csv', index=False)

Histogram of the resulting dataset looks fairly balanced across most “popular” steering angles.

Dataset histogram

Please, mind that we are balancing dataset across absolute values, as by applying horizontal flip during augmentation we end up using both positive and negative steering angles for each frame.

Data augmentation

After balancing ~1 hour worth of driving data we ended up with 7,698 samples, which most likely wouldn’t be enough for the model to generalise well. However, as many pointed out, there a couple of augmentation tricks that should let you extend the dataset significantly:

Left and right cameras. Along with each sample we receive frames from 3 camera positions: left, center and right. Although we are only going to use central camera while driving, we can still use left and right cameras data during training after applying steering angle correction, increasing number of examples by a factor of 3.

cameras = ['left', 'center', 'right']
steering_correction = [.25, 0., -.25]
camera = np.random.randint(len(cameras))
image = mpimg.imread(data[cameras[camera]].values[i])
angle = data.steering.values[i] + steering_correction[camera]

Horizontal flip. For every batch we flip half of the frames horizontally and change the sign of the steering angle, thus yet increasing number of examples by a factor of 2.

flip_indices = random.sample(range(x.shape[0]), int(x.shape[0] / 2))
x[flip_indices] = x[flip_indices, :, ::-1, :]
y[flip_indices] = -y[flip_indices]

Vertical shift. We cut out insignificant top and bottom portions of the image during preprocessing, and choosing the amount of frame to crop at random should increase the ability of the model to generalise.

top = int(random.uniform(.325, .425) * image.shape[0])
bottom = int(random.uniform(.075, .175) * image.shape[0])
image = image[top:-bottom, :]

Random shadow. We add a random vertical “shadow” by decreasing brightness of a frame slice, hoping to make the model invariant to actual shadows on the road.

h, w = image.shape[0], image.shape[1]
[x1, x2] = np.random.choice(w, 2, replace=False)
k = h / (x2 - x1)
b = - k * x1
for i in range(h):
    c = int((i - b) / k)
    image[i, :c, :] = (image[i, :c, :] * .5).astype(np.int32)

We then preprocess each frame by cropping top and bottom of the image and resizing to a shape our model expects (32×128×3, RGB pixel intensities of a 32×128 image). The resizing operation also takes care of scaling pixel values to [0, 1].

image = skimage.transform.resize(image, (32, 128, 3))

To make a better sense of it, let’s consider an example of a single recorded sample that we turn into 16 training samples by using frames from all three cameras and applying aforementioned augmentation pipeline.

Original frames

Augmented and preprocessed frames

Augmentation pipeline is applied in data.py using a Keras generator, which lets us do it in real-time on CPU while GPU is busy backpropagating!

Model

I started with the model described in Nvidia paper and kept simplifying and optimising it while making sure it performs well on both tracks. It was clear we wouldn’t need that complicated model, as the data we are working with is way simpler and much more constrained than the one Nvidia team had to deal with when running their model. Eventually I settled on a fairly simple architecture with 3 convolutional layers and 3 fully connected layers.

Model architecture

This model can be very briefly encoded with Keras.

from keras import models
from keras.layers import core, convolutional, pooling

model = models.Sequential()
model.add(convolutional.Convolution2D(16, 3, 3, input_shape=(32, 128, 3), activation='relu'))
model.add(pooling.MaxPooling2D(pool_size=(2, 2)))
model.add(convolutional.Convolution2D(32, 3, 3, activation='relu'))
model.add(pooling.MaxPooling2D(pool_size=(2, 2)))
model.add(convolutional.Convolution2D(64, 3, 3, activation='relu'))
model.add(pooling.MaxPooling2D(pool_size=(2, 2)))
model.add(core.Flatten())
model.add(core.Dense(500, activation='relu'))
model.add(core.Dense(100, activation='relu'))
model.add(core.Dense(20, activation='relu'))
model.add(core.Dense(1))

I added dropout on 2 out of 3 dense layers to prevent overfitting, and the model proved to generalise quite well. The model was trained using Adam optimiser with a learning rate = 1e-04 and mean squared error as a loss function. I used 20% of the training data for validation (which means that we only used 6,158 out of 7,698 examples for training), and the model seems to perform quite well after training for ~20 epochs — you can find the code related to training in model.py.

Results

The car manages to drive just fine on both tracks pretty much endlessly. It rarely goes off the middle of the road, this is what driving looks like on track 2 (previously unseen).

Driving autonomously on a previously unseen track

You can check out a longer highlights compilation video of the car driving itself on both tracks.

Clearly this is a very basic example of end-to-end learning for self-driving cars, nevertheless it should give a rough idea of what these models are capable of, even considering all limitations of training and validating solely on a virtual driving simulator.

Follow @alexstaravoitau Star Fork Download

Traffic signs classification with a convolutional network

2017-01-15T00:00:00+00:00

I’m assuming you already know a fair bit about neural networks and regularization, as I won’t go into too much detail about their background and how they work. I am using TensorFlow as a ML framework and a couple of dependencies like numpy, matplotlib and scikit-image. In case you are not familiar with TensorFlow, make sure to check out my recent post about its core concepts.

If you would like to follow along, you may as well need a machine with a CUDA-capable GPU and all dependencies installed. Here is a Jupyter notebook with the final solution I am describing in this tutorial, presumably if you go through all the cells you should get the same results.

Dataset

The German Traffic Sign Dataset consists of 39,209 32×32 px color images that we are supposed to use for training, and 12,630 images that we will use for testing. Each image is a photo of a traffic sign belonging to one of 43 classes, e.g. traffic sign types.

Random dataset sample

Each image is a 32×32×3 array of pixel intensities, represented as [0, 255] integer values in RGB color space. Class of each image is encoded as an integer in a 0 to 42 range. Let’s check if the training dataset is balanced across classes.

Dataset classes distribution

Apparently dataset is very unbalanced, and some classes are represented significantly better than the others. Let’s now plot a bunch of random images for various classes to see what we are working with.

Yield

No entry

General caution

Roundabout mandatory

The images differ significantly in terms of contrast and brightness, so we will need to apply some kind of histogram equalization, this should noticeably improve feature extraction.

Preprocessing

The usual preprocessing in this case would include scaling of pixel values to [0, 1] (as currently they are in [0, 255] range), representing labels in a one-hot encoding and shuffling. Looking at the images, histogram equalization may be helpful as well. We will apply localized histogram equalization, as it seems to improve feature extraction even further in our case.

I will only use a single channel in my model, e.g. grayscale images instead of color ones. As Pierre Sermanet and Yann LeCun mentioned in their paper, using color channels didn’t seem to improve things a lot, so I will only take Y channel of the YCbCr representation of an image.

import numpy as np
from sklearn.utils import shuffle
from skimage import exposure

def preprocess_dataset(X, y = None):
    #Convert to grayscale, e.g. single Y channel
    X = 0.299 * X[:, :, :, 0] + 0.587 * X[:, :, :, 1] + 0.114 * X[:, :, :, 2]
    #Scale features to be in [0, 1]
    X = (X / 255.).astype(np.float32)
      
    # Apply localized histogram localization  
    for i in range(X.shape[0]):
        X[i] = exposure.equalize_adapthist(X[i])
        
    if y is not None:  
        # Convert to one-hot encoding. Convert back with `y = y.nonzero()[1]`
        y = np.eye(43)[y]
        # Shuffle the data
        X, y = shuffle(X, y)

    # Add a single grayscale channel
    X = X.reshape(X.shape + (1,)) 
    return X, y

This is what original and preprocessed images look like:

Original

Preprocessed

Augmentation

The amount of data we have is not sufficient for a model to generalise well. It is also fairly unbalanced, and some classes are represented to significantly lower extent than the others. But we will fix this with data augmentation!

Flipping

First, we are going to apply a couple of tricks to extend our data by flipping. You might have noticed that some traffic signs are invariant to horizontal and/or vertical flipping, which basically means that we can flip an image and it should still be classified as belonging to the same class.

Some signs can be flipped either way — like Priority Road or No Entry signs.

Other signs are 180° rotation invariant, and to rotate them 180° we will simply first flip them horizontally, and then vertically.

Finally there are signs that can be flipped, and should then be classified as a sign of some other class. This is still useful, as we can use data of these classes to extend their counterparts.

Turn left Turn right

We are going to use this during augmentation. Let’s prepare a sign-flipping routine.

import numpy as np

def flip_extend(X, y):
    # Classes of signs that, when flipped horizontally, should still be classified as the same class
    self_flippable_horizontally = np.array([11, 12, 13, 15, 17, 18, 22, 26, 30, 35])
    # Classes of signs that, when flipped vertically, should still be classified as the same class
    self_flippable_vertically = np.array([1, 5, 12, 15, 17])
    # Classes of signs that, when flipped horizontally and then vertically, should still be classified as the same class
    self_flippable_both = np.array([32, 40])
    # Classes of signs that, when flipped horizontally, would still be meaningful, but should be classified as some other class
    cross_flippable = np.array([
        [19, 20], 
        [33, 34], 
        [36, 37], 
        [38, 39],
        [20, 19], 
        [34, 33], 
        [37, 36], 
        [39, 38],   
    ])
    num_classes = 43
    
    X_extended = np.empty([0, X.shape[1], X.shape[2], X.shape[3]], dtype = X.dtype)
    y_extended = np.empty([0], dtype = y.dtype)
    
    for c in range(num_classes):
        # First copy existing data for this class
        X_extended = np.append(X_extended, X[y == c], axis = 0)
        # If we can flip images of this class horizontally and they would still belong to said class...
        if c in self_flippable_horizontally:
            # ...Copy their flipped versions into extended array.
            X_extended = np.append(X_extended, X[y == c][:, :, ::-1, :], axis = 0)
        # If we can flip images of this class horizontally and they would belong to other class...
        if c in cross_flippable[:, 0]:
            # ...Copy flipped images of that other class to the extended array.
            flip_class = cross_flippable[cross_flippable[:, 0] == c][0][1]
            X_extended = np.append(X_extended, X[y == flip_class][:, :, ::-1, :], axis = 0)
        # Fill labels for added images set to current class.
        y_extended = np.append(y_extended, np.full((X_extended.shape[0] - y_extended.shape[0]), c, dtype = int))
        
        # If we can flip images of this class vertically and they would still belong to said class...
        if c in self_flippable_vertically:
            # ...Copy their flipped versions into extended array.
            X_extended = np.append(X_extended, X_extended[y_extended == c][:, ::-1, :, :], axis = 0)
        # Fill labels for added images set to current class.
        y_extended = np.append(y_extended, np.full((X_extended.shape[0] - y_extended.shape[0]), c, dtype = int))
        
        # If we can flip images of this class horizontally AND vertically and they would still belong to said class...
        if c in self_flippable_both:
            # ...Copy their flipped versions into extended array.
            X_extended = np.append(X_extended, X_extended[y_extended == c][:, ::-1, ::-1, :], axis = 0)
        # Fill labels for added images set to current class.
        y_extended = np.append(y_extended, np.full((X_extended.shape[0] - y_extended.shape[0]), c, dtype = int))
    
    return (X_extended, y_extended)

This simple trick lets us extend original 39,209 training examples to 63,538, nice! And it cost us nothing in terms of data collection or computational resources.

Rotation and projection

However, it is still not enough, and we need to augment even further. After experimenting with adding random rotation, projection, blur, noize and gamma adjusting, I have used rotation and projection transformations in the pipeline. Projection transform seems to also take care of random shearing and scaling as we randomly position image corners in a [±delta, ±delta] range.

from skimage.transform import rotate
from skimage.transform import warp
from skimage.transform import ProjectiveTransform

def rotate(X, intensity):
    for i in range(X.shape[0])):
        delta = 30. * intensity # scale using augmentation intensity
        X[i] = rotate(X[i], random.uniform(-delta, delta), mode = 'edge')
    return X  

def apply_projection_transform(X, intensity):
    image_size = X.shape[1]
    d = image_size * 0.3 * intensity
    for i in range(X.shape[0])):
        tl_top = random.uniform(-d, d)     # Top left corner, top margin
        tl_left = random.uniform(-d, d)    # Top left corner, left margin
        bl_bottom = random.uniform(-d, d)  # Bottom left corner, bottom margin
        bl_left = random.uniform(-d, d)    # Bottom left corner, left margin
        tr_top = random.uniform(-d, d)     # Top right corner, top margin
        tr_right = random.uniform(-d, d)   # Top right corner, right margin
        br_bottom = random.uniform(-d, d)  # Bottom right corner, bottom margin
        br_right = random.uniform(-d, d)   # Bottom right corner, right margin

        transform = ProjectiveTransform()
        transform.estimate(np.array((
                (tl_left, tl_top),
                (bl_left, image_size - bl_bottom),
                (image_size - br_right, image_size - br_bottom),
                (image_size - tr_right, tr_top)
            )), np.array((
                (0, 0),
                (0, image_size),
                (image_size, image_size),
                (image_size, 0)
            )))
        X[i] = warp(X[i], transform, output_shape=(image_size, image_size), order = 1, mode = 'edge')

    return X

Please note that we use edge mode when applying our transformations, to ensure that we don’t have black box around warped image. Let’s check out what the images look like when we apply random augmentation with intensity = 0.75.

Original	Augmented (intensity = 0.75)

Model

Architecture

I decided to use a deep neural network classifier as a model, which was inspired by Daniel Nouri’s tutorial and aforementioned Pierre Sermanet / Yann LeCun paper. It is fairly simple and has 4 layers: 3 convolutional layers for feature extraction and one fully connected layer as a classifier.

Model architecture

As opposed to usual strict feed-forward CNNs I use multi-scale features, which means that convolutional layers’ output is not only forwarded into subsequent layer, but is also branched off and fed into classifier (e.g. fully connected layer). Please mind that these branched off layers undergo additional max-pooling, so that all convolutions are proportionally subsampled before going into classifier.

Regularization

I use the following regularization techniques to minimize overfitting to training data:

Dropout. Dropout is amazing and will drastically improve generalization of your model. Normally you may only want to apply dropout to fully connected layers, as shared weights in convolutional layers are good regularizers themselves. However, I did notice a slight improvement in performance when using a bit of dropout on convolutional layers, thus left it in, but kept it at minimum:

                Type           Size         keep_p      Dropout
 Layer 1        5x5 Conv       32           0.9         10% of neurons  
 Layer 2        5x5 Conv       64           0.8         20% of neurons
 Layer 3        5x5 Conv       128          0.7         30% of neurons
 Layer 4        FC             1024         0.5         50% of neurons

L2 Regularization. I ended up using lambda = 0.0001 which seemed to perform best. Important point here is that L2 loss should only include weights of the fully connected layers, and normally it doesn’t include bias term. Intuition behind it being that bias term is not contributing to overfitting, as it is not adding any new degree of freedom to a model.
Early stopping. I use early stopping with a patience of 100 epochs to capture the last best-performing weights and roll back when model starts overfitting training data. I use validation set cross entropy loss as an early stopping metric, intuition behind using it instead of accuracy is that if your model is confident about its predictions it should generalize better.

Implementation

I find it helpful defining a structure holding hyperparameters I will be experimenting with and fine-tuning. It makes the process of tuning them easier, and even automate it in some cases.

from collections import namedtuple

Parameters = namedtuple('Parameters', [
        # Data parameters
        'num_classes', 'image_size', 
        # Training parameters
        'batch_size', 'max_epochs', 'log_epoch', 'print_epoch',
        # Optimisations
        'learning_rate_decay', 'learning_rate',
        'l2_reg_enabled', 'l2_lambda', 
        'early_stopping_enabled', 'early_stopping_patience', 
        'resume_training', 
        # Layers architecture
        'conv1_k', 'conv1_d', 'conv1_p', 
        'conv2_k', 'conv2_d', 'conv2_p', 
        'conv3_k', 'conv3_d', 'conv3_p', 
        'fc4_size', 'fc4_p'
    ])

Let’s first declare a couple of helpful TensorFlow routines that implement individual types of layers.

import tensorflow as tf

def fully_connected(input, size):
    weights = tf.get_variable( 'weights', 
        shape = [input.get_shape()[1], size],
        initializer = tf.contrib.layers.xavier_initializer()
      )
    biases = tf.get_variable( 'biases',
        shape = [size],
        initializer = tf.constant_initializer(0.0)
      )
    return tf.matmul(input, weights) + biases

def fully_connected_relu(input, size):
    return tf.nn.relu(fully_connected(input, size))

def conv_relu(input, kernel_size, depth):
    weights = tf.get_variable( 'weights', 
        shape = [kernel_size, kernel_size, input.get_shape()[3], depth],
        initializer = tf.contrib.layers.xavier_initializer()
      )
    biases = tf.get_variable( 'biases',
        shape = [depth],
        initializer = tf.constant_initializer(0.0)
      )
    conv = tf.nn.conv2d(input, weights,
        strides = [1, 1, 1, 1], padding = 'SAME')
    return tf.nn.relu(conv + biases)

def pool(input, size):
    return tf.nn.max_pool(
        input, 
        ksize = [1, size, size, 1], 
        strides = [1, size, size, 1], 
        padding = 'SAME'
    )

I am using Xavier initializer, which automatically determines the scale of initialization based on the layers’ dimensions, hence there are less parameter we need to experiment with.

We can now encode the model, getting most of variable scopes, which makes code easier to read and maintain. This method will perform a full model pass.

def model_pass(input, params, is_training):
    """
    Performs a full model pass.
    
    Parameters
    ----------
    input         : Tensor
                    Batch of examples.
    params        : Parameters
                    Structure (`namedtuple`) containing model parameters.
    is_training   : Tensor of type tf.bool
                    Flag indicating if we are training or not (e.g. whether to use dropout).
                    
    Returns
    -------
    Tensor with predicted logits.
    """
    # Convolutions

    with tf.variable_scope('conv1'):
        conv1 = conv_relu(input, kernel_size = params.conv1_k, depth = params.conv1_d) 
        pool1 = pool(conv1, size = 2)
        pool1 = tf.cond(is_training, lambda: tf.nn.dropout(pool1, keep_prob = params.conv1_p), lambda: pool1)
    with tf.variable_scope('conv2'):
        conv2 = conv_relu(pool1, kernel_size = params.conv2_k, depth = params.conv2_d)
        pool2 = pool(conv2, size = 2)
        pool2 = tf.cond(is_training, lambda: tf.nn.dropout(pool2, keep_prob = params.conv2_p), lambda: pool2)
    with tf.variable_scope('conv3'):
        conv3 = conv_relu(pool2, kernel_size = params.conv3_k, depth = params.conv3_d)
        pool3 = pool(conv3, size = 2)
        pool3 = tf.cond(is_training, lambda: tf.nn.dropout(pool3, keep_prob = params.conv3_p), lambda: pool3)
    
    # Fully connected
    
    # 1st stage output
    pool1 = pool(pool1, size = 4)
    shape = pool1.get_shape().as_list()
    pool1 = tf.reshape(pool1, [-1, shape[1] * shape[2] * shape[3]])
    
    # 2nd stage output
    pool2 = pool(pool2, size = 2)
    shape = pool2.get_shape().as_list()
    pool2 = tf.reshape(pool2, [-1, shape[1] * shape[2] * shape[3]])    
    
    # 3rd stage output
    shape = pool3.get_shape().as_list()
    pool3 = tf.reshape(pool3, [-1, shape[1] * shape[2] * shape[3]])
    
    flattened = tf.concat(1, [pool1, pool2, pool3])
    
    with tf.variable_scope('fc4'):
        fc4 = fully_connected_relu(flattened, size = params.fc4_size)
        fc4 = tf.cond(is_training, lambda: tf.nn.dropout(fc4, keep_prob = params.fc4_p), lambda: fc4)
    with tf.variable_scope('out'):
        logits = fully_connected(fc4, size = params.num_classes)
    return logits

Note that we collect all branched off convolutional layers’ output, flatten and concatenate them before passing over to classifier.

If you have questions about TensorFlow implementation, make sure to check out my TensorFlow post about variable scopes, saving and restoring sessions, implementing dropout and other interesting things!

Training

I have generated two datasets for training my model using augmentation pipeline I mentioned earlier:

Extended dataset. This dataset simply contains 20x more data than the original one — e.g. for each training example we generate 19 additional examples by jittering original image, with augmentation intensity = 0.75.
Balanced dataset. This dataset is balanced across classes and has 20.000 examples for each class. These 20k contain original training dataset, as well as jittered images from the original training set (with augmentation intensity = 0.75) to complete number of examples for each class to 20.000 images.

Disclaimer: Training on extended dataset may not be the best idea, as some classes remain significantly less represented than the others there. Training a model with this dataset would make it biased towards predicting overrepresented classes. However, in our case we are trying to score highest accuracy on supplied test dataset, which (probably) follows the same classes distribution. So we are going to cheat a bit and use this extended dataset for pre-training — this has proven to make test set accuracy higher (although hardly makes a model perform better “in the field”!).

I then use 25% of these augmented datasets for validation while training in 2 stages:

Stage 1: Pre-training. On the first stage I pre-train the model using extended training dataset with TensorFlow AdamOptimizer and learning rate set to 0.001. It normally stops improving after ~180 epochs, which takes ~3.5 hours on my machine equipped with Nvidia GTX 1080 GPU.
Stage 2: Fine-tuning. I then train the model using a balanced dataset with a decreased learning rate of 0.0001.

These two training stages could easily get you past 99% accuracy on the test set. You can, however, improve model performance even further by re-generating balanced dataset with slightly decreased augmentation intensity and repeating 2nd fine-tuning stage a couple of times.

Visualization

As an illustration of what a trained neural network looks like, let’s plot weights of the first convolutional layer. First layer has dimensions of 5×5×1×32, which means that it consists of 32 5×5 filters — we can visualize them as 32 5×5 px grayscale images.

**5×5 convolutional filters of the first layer**

Raw	Interpolated

We usually expect the first layer to contain filters that can detect very basic pixel patterns, like edges and lines. These basic filters are then used by subsequent layers as building bricks to construct detectors of more complicated patterns and figures.

Results

After a couple of fine-tuning training iterations this model scored 99.33% accuracy on the test set, which is not too bad. As there was a total of 12,630 images that we used for testing, apparently there are 85 examples that the model could not classify correctly — let’s take a look at those bad boys!

**Remaining 85 errors out of 12,630 samples of the test set**

Original	Preprocessed

Signs on most of the images either have artefacts like shadows or obstructing objects. There are, however, a couple of signs that were simply underrepresented in the training set — training solely on balanced datasets could potentially eliminate this issue, and using some sort of color information could definitely help as well.

Finally, this model provides mildly interesting predictions for types of signs it wasn’t trained for.

Predictions for a new type of sign

To clarify, this Elderly crossing sign was not among those 43 classes this model was trained for, yet what we see here is a reasonable assumption that it looks a lot like Road narrows on the right sign. Ironically, classifier’s second guess was that this Elderly crossing sign should be classified as Children crossing!

In conclusion, according to different sources human performance on a similar task varies from 98.3% to 98.8%, therefore this model seems to outperform an average human. Which, I believe, is the ultimate goal of machine learning!

Follow @alexstaravoitau Star Fork Download

Detecting facial keypoints with TensorFlow

2017-01-09T00:00:00+00:00

This is a TensorFlow follow-along for an amazing Deep Learning tutorial by Daniel Nouri. Daniel describes ways of approaching a computer vision problem of detecting facial keypoints in an image using various deep learning techniques, while these techniques gradually build upon each other, demonstrating advantages and limitations of each. I highly recommend going through the steps if you are interested in the topic and prefer learning by example.

However, Daniel uses Lasagne as a machine learning framework, and I’m currently learning to use TensorFlow, so I thought I would publish my follow-along tutorial where I’m utilising the very same approach, but using TensorFlow for building models on each of the steps. Daniel is using a set of different models that tend to gradually get more complicated (and perform better), so I did the same and broke down the tutorial into three Jupyter notebooks:

First model: a single hidden layer. A very simple neural network.
Second model: convolutions. Convolutional neural network with data augmentation, learning rate decay and dropout.
Third model: training specialists. A pipeline of specialist CNNs with early stopping and supervised pre-training.

Let’s take a look at them and check out the differences when it comes to TensorFlow. You can get the notebooks here:

Star Fork Download

First model: a single hidden layer.

This is a fairly simple model, so it was easy to recreate it in TensorFlow. If you are not familiar with TensorFlow framework, here is how it works: you first build a computation graph, which means you specify all variables you are planning to use, as well as all the relations across those variables. Then you evaluate specific variables from that graph that you are interested in, triggering computation of a path in the graph that leads to them. So in our case we will define a neural network structure and its loss, and will then train it by evaluating a TensorFlow loss optimiser, feeding it with batches of training data over and over again.

First, let’s introduce a couple of handy functions that will help us defining model architecture.

def fully_connected(input, size):
    weights = tf.get_variable( 'weights', 
        shape = [input.get_shape()[1], size],
        initializer = tf.contrib.layers.xavier_initializer()
      )
    biases = tf.get_variable( 'biases',
        shape = [size],
        initializer=tf.constant_initializer(0.0)
      )
    return tf.matmul(input, weights) + biases

This function performs a single fully connected neural network layer pass. You only need to provide input and define number of units, it will work out the rest and initialise its weights. It’s very handy, since now we can use the same function for defining as many fully connected layers as we like. Let’s define our model structure and use this function for defining a hidden and output layers:

def model_pass(input):
    with tf.variable_scope('hidden'):
        hidden = fully_connected(input, size = 100)
    relu_hidden = tf.nn.relu(hidden)
    with tf.variable_scope('out'):
        prediction = fully_connected(relu_hidden, size = num_keypoints)
    return prediction

This function performs a full model pass. It takes our array of features, passes it over to hidden layer (containing 100 units), then feeds the hidden output to output layer which in its turn produces vector of output values.

Please, note that we used fully_connected() function defined earlier for both layers, and thanks to TensorFlow’s concept of variable_scope we didn’t have to specify variables for weights and biases of each. You can think of it this way: in this example we implicitly create variables with the following names:

hidden/weights
hidden/bias
out/weights
out/bias

You don’t have to use full names of those variables each time, instead you simply specify a block with variable scope — and whenever you try to get hold of a variable using tf.get_variable() within that block, the scope would be appended to each of your variables names.

Ok, now let’s define our training graph. First, just as we did for each of the layers, we will use a variable scope for the whole model.

# This model has 1 fully connected layer, we train it using batches of 36 examples for 1000 epochs.
model_variable_scope = "1fc_b36_e1000"

So our variables would now have names like: 1fc_b36_e1000/hidden/weights, 1fc_b36_e1000/hidden/bias and so on.

Next thing we initialise a graph.

graph = tf.Graph()

with graph.as_default():
	...

Strictly speaking we didn’t have to do that, as there is always a default graph and we could just use it. But where is fun in that?

Whatever comes in with graph.as_default(): block defines our graph: all of the graph variables and their relations.

with graph.as_default():
    # Input data. For the training data, we use a placeholder that will be fed at run time with a training minibatch.
    tf_x_batch = tf.placeholder(tf.float32, shape = (None, image_size * image_size))
    tf_y_batch = tf.placeholder(tf.float32, shape = (None, num_keypoints))

    # Training computation.
    with tf.variable_scope(model_variable_scope):
        predictions = model_pass(tf_x_batch)
    
    loss = tf.reduce_mean(tf.square(predictions - tf_y_batch))

    # Optimizer.
    optimizer = tf.train.MomentumOptimizer(
        learning_rate = learning_rate, 
        momentum = momentum, 
        use_nesterov = True
    ).minimize(loss)

Here we define a couple of tf.placeholders — these are not variables per se, they are, well, just placeholders. They are not trainable, and we don’t need to initialise them during graph build time. Instead, we provide what’s going to be in them during run time, while evaluating portions of our graph. Here we will use them to feed model with training examples in batches, and those examples will, of course, change after every weights update. Note that we don’t explicitly specify batch size during graph build time, and instead use None as the first dimension of placeholders’ shapes.

We then define computation of model predictions and loss, create an optimiser for our model and off we go!

Now we need to run that graph using tf.Session object. Every session has a graph, so we specify one when initialising our session. Also, before doing any computation you need to initialise all graph variables by running tf.global_variables_initializer().

with tf.Session(graph = graph) as session:
	# Initialise all variables in the graph
    session.run(tf.global_variables_initializer())
	...

Once we are in the scope of initialised session we can actually perform the training procedure:

for epoch in range(num_epochs):
    # Train on whole randomised dataset in batches
    batch_iterator = BatchIterator(batch_size = batch_size, shuffle = True)
    for x_batch, y_batch in batch_iterator(x_train, y_train):
        session.run([optimizer], feed_dict = {
                tf_x_batch : x_batch, 
                tf_y_batch : y_batch
            }
        )

What happens here is that we ask the session to evaluate optimizer, which will implicitly run a sub-graph containing every variable that optimizer uses. From definition you can see it uses loss (value it is optimising), which in its turn uses predictions, etc. We also provide values that should be put in our data feeding tf.placeholders by providing feed_dict parameter. This means that by the time computation of the path leading to optimizer begins, tf_x_batch and tf_y_batch placeholders would be holding x_batch and y_batch values respectively.

When training finishes we need to run our trained model on the testing data. We do this within the scope of the same session:

# Evaluate on test dataset (also in batches).
batch_iterator = BatchIterator(batch_size = 128)
predictions = []
for x_batch, _ in batch_iterator(x_test):
    [p_batch] = session.run([predictions], feed_dict = {
            tf_x_batch : x_batch 
        }
    )
    predictions.extend(p_batch)

test_loss = np.mean(np.square(predictions - y_test))
print(" Test score: %.3f (loss = %.8f)" % (np.sqrt(test_loss) * 48.0, test_loss)) 

Since this operation would be performed on GPU by default (if you’re running a GPU version of TensorFlow), you may bump into your GPU’s memory limitations, therefore I’m suggesting batching your testing data as well.

I’m still using a tiny bit of Lasagne here, more specifically its BatchIterator. Further in tutorial Daniel uses this BatchIterator for data augmentation and it fits perfectly into the workflow. Also, as far as I’m aware TensorFlow lacks a similar plug-and-play component for iterating over data in batches, and one would have to define their own tf.train.Example type and setup a pipeline for tf.TFRecordReader, feeding it to the model with a tf.train.QueueRunner. Although this seems like a bucket of joy, I thought I would go with a plain vanilla BatchIterator, and concentrate on building a model instead. Data feeding in TensorFlow seems to be a broad topic, and would make a good article on its own!

As you see we only supply tf_x_batch value in the feed_dict, since we only evaluate predictions variable here, and its path in the graph does not involve tf_y_batch — we are not calculating loss as a part of this computation after all.

One of the neat Lasagne features is keeping track of training history by logging validation and training losses. As far as I’m aware TensorFlow doesn’t do that for you, so we will have to come up with some other solution.

One might be tempted to use tf.train.SummaryWriters and visualise data using TensorBoard, and actually that’s exactly what I did at first. I even managed to plot training and validation losses on the same graph and overcome a couple of other issues, but in the end tf.train.SummaryWriter seemed to slow down training process quite a bit. I’m not sure if it was due to me not using it correctly, or it’s just the way it works, but I got much better results in terms of speed using simple arrays, saving them to disk and plotting losses with matplotlib.

First let’s refactor out part where we evaluate model on the testing dataset into a function. We’re going to use it quite a lot, as the plan is to periodically get validation and training datasets predictions during training and logging those losses:

def get_predictions_in_batches(X, session):
    """
    Calculates predictions in batches of 128 examples at a time, using `session`'s calculation graph.
    """
    p = []
    batch_iterator = BatchIterator(batch_size = 128)
    for x_batch, _ in batch_iterator(X):
        [p_batch] = session.run([predictions], feed_dict = {
                tf_x_batch : x_batch 
            }
        )
        p.extend(p_batch)
    return p

This function here is just a convenient way of getting predictions for a dataset on the model’s weights it has learned so far.

Now let’s add a couple of arrays: train_loss_history and valid_loss_history for keeping track of training and validation losses respectively. Let’s rewrite our training code as follows:

def calc_loss(predictions, labels):
    """
    Squared mean error for given predictions.
    """
    return np.mean(np.square(predictions - labels))

with tf.Session(graph = graph) as session:
    tf.initialize_all_variables().run()

    train_loss_history = np.zeros(num_epochs)
    valid_loss_history = np.zeros(num_epochs)

    print("============ TRAINING =============")
    for epoch in range(num_epochs):
        # Train on whole randomised dataset in batches
        batch_iterator = BatchIterator(batch_size = batch_size, shuffle = True)
        for x_batch, y_batch in batch_iterator(x_train, y_train):
            session.run([optimizer], feed_dict = {
                    tf_x_batch : x_batch, 
                    tf_y_batch : y_batch
                }
            )

        # Another epoch ended, let's log our losses.
        # Get training data predictions and log training loss:
        train_loss = calc_loss(
            get_predictions_in_batches(x_train, session), 
            y_train
        )
        train_loss_history[epoch] = train_loss

        # Get validation data predictions and log validation loss:
        valid_loss = calc_loss(
            get_predictions_in_batches(x_valid, session), 
            y_valid
        )
        valid_loss_history[epoch] = valid_loss
        
        if (epoch % 100 == 0):
            print("--------- EPOCH %4d/%d ---------" % (epoch, num_epochs))
            print("     Train loss: %.8f" % (train_loss))
            print("Validation loss: %.8f" % (valid_loss))

    # Evaluate on test dataset.
    test_loss = calc_loss(
        get_predictions_in_batches(x_test, session), 
        y_test
    )
    print("===================================")
    print(" Test score: %.3f (loss = %.8f)" % (np.sqrt(test_loss) * 48.0, test_loss)) 
    np.savez(os.getcwd() + "/train_history", train_loss_history = train_loss_history, valid_loss_history = valid_loss_history)

You can now load training history from file and use Daniel’s code to plot learning curves and see how your model is performing.

model_history = np.load(os.getcwd() + "/train_history.npz")
train_loss = model_history["train_loss_history"]
valid_loss = model_history["valid_loss_history"]
x_axis = np.arange(num_epochs)
pyplot.plot(x_axis, train_loss, "b-", linewidth=2, label="train")
pyplot.plot(x_axis, valid_loss, "g-", linewidth=2, label="valid")
pyplot.grid()
pyplot.legend()
pyplot.xlabel("epoch")
pyplot.ylabel("loss")
pyplot.ylim(0.0005, 0.01)
pyplot.xlim(0, num_epochs)
pyplot.yscale("log")
pyplot.show()

You may want to only log losses every, say, 5 or 10 epochs, as evaluating on the whole training set does take a while. However, you may need validation loss later on in order to implement early stopping.

Second model: convolutions.

In the second model we will add convolutions, which should improve model performance significantly. Let’s declare a couple of additional convenience functions:

def conv_relu(input, kernel_size, depth):
    weights = tf.get_variable( 'weights', 
        shape = [kernel_size, kernel_size, input.get_shape()[3], depth],
        initializer = tf.contrib.layers.xavier_initializer()
      )
    biases = tf.get_variable( 'biases',
        shape = [depth],
        initializer=tf.constant_initializer(0.0)
      )
    conv = tf.nn.conv2d(input, weights,
        strides=[1, 1, 1, 1], padding='SAME')
    return tf.nn.relu(conv + biases)

This one will perform a convolutional layer pass followed by a rectified linear unit (since usually those two are applied together). As you see we’re using tf.get_variable() again, so we can reuse this function with different layers by simply providing variable scope. Let’s add a couple of other helper functions to make encoding of our model architecture easier:

def fully_connected_relu(input, size):
    return tf.nn.relu(fully_connected(input, size))

def pool(input, size):
    return tf.nn.max_pool(
        input, 
        ksize=[1, size, size, 1], 
        strides=[1, size, size, 1], 
        padding='SAME'
    )

Ok, with these routines we can now encode our full model pass.

def model_pass(input, training):
    # Convolutional layers
    with tf.variable_scope('conv1'):
        conv1 = conv_relu(input, kernel_size = 3, depth = 32) 
    with tf.variable_scope('pool1'): 
        pool1 = pool(conv1, size = 2)
        # Apply dropout if needed
        pool1 = tf.cond(training, lambda: tf.nn.dropout(pool1, keep_prob = 0.9 if dropout else 1.0), lambda: pool1)
    with tf.variable_scope('conv2'):
        conv2 = conv_relu(pool1, kernel_size = 2, depth = 64)
    with tf.variable_scope('pool2'):
        pool2 = pool(conv2, size = 2)
        # Apply dropout if needed
        pool2 = tf.cond(training, lambda: tf.nn.dropout(pool2, keep_prob = 0.8 if dropout else 1.0), lambda: pool2)
    with tf.variable_scope('conv3'):
        conv3 = conv_relu(pool2, kernel_size = 2, depth = 128)
    with tf.variable_scope('pool3'):
        pool3 = pool(conv3, size = 2)
        # Apply dropout if needed
        pool3 = tf.cond(training, lambda: tf.nn.dropout(pool3, keep_prob = 0.7 if dropout else 1.0), lambda: pool3)
    
    # Flatten convolutional layers output
    shape = pool3.get_shape().as_list()
    flattened = tf.reshape(pool3, [-1, shape[1] * shape[2] * shape[3]])
    
    # Fully connected layers
    with tf.variable_scope('fc4'):
        fc4 = fully_connected_relu(flattened, size = 1000)
        # Apply dropout if needed
        fc4 = tf.cond(training, lambda: tf.nn.dropout(fc4, keep_prob = 0.5 if dropout else 1.0), lambda: fc4)
    with tf.variable_scope('fc5'):
        fc5 = fully_connected_relu(fc4, size = 1000)
    with tf.variable_scope('out'):
        prediction = fully_connected(fc5, size = num_keypoints)
    return prediction

Please note those weird assignments:

fc4 = tf.cond(training, lambda: tf.nn.dropout(fc4, keep_prob = 0.5 if dropout else 1.0), lambda: fc4)

Let’s break it down a bit.

First we calculate 0.5 if dropout else 1.0, which means that we conditionally apply dropout, if dropout flag is set to True. This is done so that later you could compare how same model performs with and without dropout.

Furthermore, we only want to apply dropout while training, and not while evaluating our model, that’s why we put the assignment into a tf.cond(training, lambda: ..., lambda: fc4) block. It means that if training (being a tf.Variable) is True, we will apply dropout, and will simply assign fc4 to itself otherwise.

Also note that we have to manually flatten convolutional layers’ output before passing it over to fully connected layers.

A couple of new things you may notice in the TensorFlow graph are is_training flag, learning rate decay and momentum increase. The is_training flag is another TensorFlow placeholder we use to indicate if we’re training or evaluating. In latter case model pipeline function won’t apply dropout. I implemented momentum increase in plain Python by checking how far have we gone into the maximum number of epochs. As for learning rate decay, there is a nice TensorFlow function for that: tf.train.exponential_decay() lets you do exactly that, providing number of decay steps and decay rate.

The rest should be familiar from the first notebook. Please, mind that some optimisation options are defined as flags (for instance, data_augmentation, learning_rate_decay, etc.) which are encoded into model name. This is done so that you could compare performance with different optimisation techniques applied. Just provide the name of the model as a parameter of plot_learning_curves() method and learning curves of that model would be drawn on top of the current plot:

new_model_epochs = plot_learning_curves()
old_model_epochs = plot_learning_curves("1fc_b36_e1000", linewidth = 1)
pyplot.ylim(0.001, 0.01)
pyplot.xlim(0, max(new_model_epochs, old_model_epochs))
pyplot.show()

Third model: training specialists.

The third notebook implements the most advanced model of this tutorial: training specialists for groups of facial keypoints. It also covers another great technique for battling overfitting: early stopping. EarlyStopping class from Daniel’s tutorial requires one crucial modification when working with TensorFlow: in order to save and restore trained weights we need a reference to current TensorFlow session and tf.train.Saver object. TensorFlow Saver is doing exactly what you would expect it to do: lets you easily save and restore variables from your session graph — for instance, trained weights. You simply call save() to save current weights to a checkpoint file, or restore() in order to load those weights into your session’s graph.

saver.save(session, checkpoint_path)
# Weights are now saved in a file located at `checkpoint_path`.

saver.restore(session, checkpoint_path)
# Saved weights are loaded into corresponding variables again.

As easy as that! One thing to consider is that when restoring session your graph (e.g. variables’s names and relations) is expected to be exactly the same as it was during saving, so that saver knows which weights to load where. The easiest way to do so is saving and restoring the graph with tf.train.export_meta_graph and tf.train.import_meta_graph functions.

However, what if your graph is not the same? Well, actually we run into exactly this problem when reusing previously trained model as a specialist. The idea is that we initialise weights for each specialist with values from a pre-trained model (the one we implemented in notebook #2 — 3con_2fc_b36_e1000_aug_lrdec_mominc_dr). Unfortunately, the graph for a single specialist is not going to be the same due to a different shape of the out layer, e.g. number of keypoints the model provides as an output. Also, we are using a different variable scope. In order to fix that we do the following:

spec_var_scope = "specialist_variable_scope"
initialising_model = "3con_2fc_b36_e1000_aug_lrdec_mominc_dr"

# Exclude output layer weights from variables we will restore
variables_to_restore = [v for v in tf.global_variables() if "/out/" not in v.op.name]

# Replace variables scope with that of the current model
loader = tf.train.Saver({v.op.name.replace(spec_var_scope, initialising_model): v for v in variables_to_restore})

loader.restore(session, "/3con_2fc_b36_e1000_aug_lrdec_mominc_dr/model.ckpt")

By default tf.train.Saver will restore all variables you have in your graph. However, you can provide a list of variables to be restored — that’s how we are going to exclude output layer weights from the list of the values we are restoring. Important thing to remember is that variable scope is pretty much a namespace that is encoded in the variable name, separated by slashes (/). That’s why we simply filter out variables with /out/ in their names from all variables in the graph:

variables_to_restore = [v for v in tf.all_variables() if "/out/" not in v.op.name]

Next thing to do is updating the variable scope. This is done, again, by simply updating variables’ names, e.g. by replacing occurences of the old model name with your current variable scope:

...{v.op.name.replace(spec_var_scope, initialising_model): v for v in variables_to_restore}

Let’s assume this is what your saved model looks like. Say, it has a graph with the following variables:

3con_2fc_b36_e1000_aug_lrdec_mominc_dr/fc4/weights
3con_2fc_b36_e1000_aug_lrdec_mominc_dr/fc4/biases
3con_2fc_b36_e1000_aug_lrdec_mominc_dr/out/weights
3con_2fc_b36_e1000_aug_lrdec_mominc_dr/out/biases

After we apply our transformations it is converted to:

specialist_variable_scope/fc4/weights
specialist_variable_scope/fc4/biases

And .../out/weights and .../out/biases are gone, since they had /out/ in their names.

You can now plot learning curves for each of the speclialists, and, as Daniel suggests, explore ways of improving your model even further. As he points out, some specialists overfit more than the others, so there might be sense in using different dropout values for each of them. One might want to also experiment with additional regularisation techniques, like L2 loss, and probably take some further steps with data augmentation.

Follow @alexstaravoitau Star Fork Download

Cloud logger

2016-12-25T00:00:00+00:00

Most of the tasks in data science are long-running, and many folks (me included) execute those tasks on remote machines. And the crucial thing for those tasks is logging: you do need to know how training process was going and see the learning curves. It would also be convenient if you could access those logs from anywhere and be notified when the process had finished. So I built the cloudlog!

`cloudlog`

cloudlog is a very simple Python logger that duplicates your console logs to a local file, saves a copy safely in the cloud, and can as well notify you via messenger bot. And it can do all those things with pyplot plots as well! For cloud service I went with Dropbox, as it’s easy to integrate and can be accessed from any device such as your phone. For messenger I chose Telegram, being a huge fan of the platform.

How to use

Install package:

pip install cloudlog

Import CloudLog class:

from cloudlog import CloudLog

Log text by simply calling a CloudLog instance:

log = CloudLog(root_path='~/logs'))
log('Some important stuff happening.')
log('And again!')
log('Luckily, it\'s all safe now in a local file.')

Add pyplot plots as images in the same folder:

from matplotlib import pyplot

# Draw a plot
x = range(42)
pyplot.plot(x, x)
pyplot.xlabel('Amount of logs')
pyplot.ylabel('Coolness of your app')
pyplot.grid(True)

# Call it before calling `pyplot.show()`.
log.add_plot()

pyplot.show()

Dropbox

In order to sync your logs and plots to Dropbox, do the following.

Create a Dropbox app with App folder access type.
Get your Dropbox access token and provide it in initialiser.
Call sync() in order to dispatch log file to your Dropbox app folder.

log = CloudLog(root_path='~/logs', dropbox_token='YOUR_DROPBOX_TOKEN_HERE')

log('Some important stuff happening again.')
log('Luckily, it\'s all safe now. In the cloud!')
log.sync()

Plots are being synced to Dropbox folder by default.

You may as well get notifications in a Telegram chat, with logs and plots being sent to you.

Create a Telegram bot.
Get your Telegram Bot API access token
Find out your Telegram chat or user ID.
Provide both values in the initialiser.

log = CloudLog(root_path='~/logs', telegram_token='YOUR_TELEGRAM_TOKEN', telegram_chat_id='CHAT_ID')

log('Some important stuff once more.')
log('Luckily, it\'s all safe now in a local file. AND you\'re notified — how cool is that?')

log.sync(notify=True, message='I\'m pregnant.')

Specify the same notify flag for plots for them to be sent to a Telegram chat as well:

...
log.add_plot(notify=True)

Since one may be tempted to dispatch a bunch of updates at the same time, the user will not be notified about messages containing files, such as plots and logs — only about the message passed to sync() method.

There you go! Your remote machine will now not only safely store your logs in the cloud, providing easy access from anywhere, but will as well send you notification with a full report.

You could have guessed that Fenton is the name of my remote machine, of course!

Follow @alexstaravoitau Star Fork Download