Facial keypoints extraction using Caffe

Facial keypoints extraction is a challenging problem and deep learning is a hot topic and much explored area. Caffe is a popular deep learning library implementing deep learning on large datasets. Caffe is much much faster and convolutional neural networks seems to perform much better with images and can run in GPU and CPU. Running in Nvidia GPU with CUDA support is almost 20X faster than CPU. MNIST dataset classification has achieved more than 99% accuracy using convolutions in images.

The facial keypoints problem is a classic multi label regression problem. The input is (96,96) pixel grayscale images and we have to determine 30 outputs. Kaggle has a interesting facial keypoints competition: https://www.kaggle.com/c/facial-keypoints-detection

Before we move on to using caffe, lets try a simple linear regression model, to see if we can crack this problem with lowest MSE (mean square error). We will use FANN neural network library written in C as native C code is so much faster is todays CPUs than python.

What do you need

Before we proceed further, you must be familiar with

Python
Pandas
C/C++
Scikit-learn, Scikit-Image
Numpy/Scipy
CUDA from Nvidia
Linux (ubuntu)
Compiling various tools and libraries in Linux/Mac
Basic knowledge of Caffe and Deep Learning

FANN Neural Network Library

Implementing a neural network using FANN is the easiest i have seen. You can find the documentation here: http://leenissen.dk/fann/html/files/fann_train-h.html

All you have to do is dump the image data, into the format FANN is able to read and the netowrk model will be able to learn in this format. We reduce and shrink the images to 28×28 pixels and dump the data for our network to read. We use 100% of the training data without dropping the NaNs and we populate it with mean values.

A simple linear regression using FANN neural network perceptron model. after about 100 iterations produces only MSE of 0.000146. All i did was to connect inputs 24*24 size images (shrinked by 4X) to 30 outputs.

 Epochs          980. Current error: 0.0001471710. Bit fail 0.
 Epochs          990. Current error: 0.0001470396. Bit fail 0.
 Epochs         1000. Current error: 0.0001469008. Bit fail 0.

As you can see, the real training error is 0.000146 * 96 * 4 = 0.056

You can find all the files in github project https://github.com/olddocks/facialkeypoints. You have to download the kaggle csv files as well.

prepare_test.py  -> Dumps the images as training data for FANN to read
 prepare_train.py -> Prepares and dumps the test data to FANN format
 facial.c -> Neural network trainer
 ftest.c -> Testing and predictions (produces results.txt)
 kaggle.py -> Produces kaggle.csv (results to upload to kaggle)

To start the trainer, first you have to first compile the C code.

gcc facial.c -o facial -lfann2 -lm -I /usr/local/include/fann
gcc ftest.c -o ftest -lfann2 -lm -I /usr/local/include/fann

After you run

python prepare_data.py
python prepare_test.py
./facial
./ftest
python kaggle.py

Lets analyze the results are like this first test images 1,2. The first shows 30 predicted against the original values. The one in brackets are difference between predicted and original, just for the understanding on how far the difference is..

Testing network..
0  69.45/71.08(1.63)  38.68/39.58(0.91)  28.68/26.33(-2.35)  37.64/38.05(0.42)  59.64/59.26(-0.39)  36.75/36.07(-0.68)  74.55/73.94(-0.61)  35.51/34.62(-0.89)  36.81/37.48(0.67)  39.01/39.40(0.39)  21.39/22.06(0.67)  39.60/40.33(0.73)  54.88/53.30(-1.58)  29.95/29.99(0.04)  81.32/80.95(-0.37)  28.47/27.98(-0.49)  38.53/38.78(0.25)  32.26/33.30(1.04)  14.30/14.80(0.50)  33.95/35.98(2.04)  49.06/47.37(-1.69)  65.84/69.80(3.96)  70.50/72.60(2.10)  73.40/71.86(-1.54)  32.93/34.23(1.30)  77.32/77.34(0.02)  50.59/51.60(1.01)  75.66/75.96(0.31)  48.09/46.98(-1.11)  82.92/81.27(-1.64)

1  64.18/64.40(0.22)  36.78/38.50(1.73)  30.59/29.44(-1.15)  39.62/39.95(0.33)  59.07/59.26(0.19)  37.45/36.07(-1.38)  72.72/73.94(1.23)  36.10/34.62(-1.47)  37.28/37.48(0.20)  39.42/39.40(-0.02)  23.08/22.06(-1.02)  39.85/40.33(0.48)  54.15/53.30(-0.85)  31.38/29.99(-1.39)  78.95/80.95(1.99)  29.59/27.98(-1.61)  38.57/38.78(0.21)  33.13/33.30(0.17)  16.28/14.80(-1.49)  34.13/35.98(1.85)  46.06/46.74(0.68)  61.04/63.73(2.69)  68.53/72.60(4.07)  74.52/71.86(-2.65)  34.41/34.23(-0.18)  77.68/77.34(-0.34)  50.34/51.60(1.25)  75.32/75.96(0.64)  47.66/48.18(0.52)  76.92/75.27(-1.65)

2  68.75/69.26(0.51)  38.35/40.36(2.01)  29.91/28.49(-1.42)  35.56/36.19(0.63)  59.29/59.26(-0.03) 36.69/36.07(-0.63)  74.89/73.94(-0.95)  35.78/34.62(-1.16)  36.97/37.48(0.51)  38.88/39.40(0.52)  21.42/22.06(0.64)  39.23/40.33(1.10)  54.13/53.30(-0.83)  30.54/29.99(-0.55)  81.75/80.95(-0.80)  28.97/27.98(-0.99)  38.69/38.78(0.09)  32.46/33.30(0.85)  14.35/14.80(0.45)  34.06/35.98(1.92)  46.11/45.83(-0.29)  65.68/75.67(9.99)  69.40/72.60(3.20)  74.28/71.86(-2.41)  33.11/34.23(1.13)  77.40/77.34(-0.06)  49.99/51.60(1.61)  76.66/75.96(-0.70)  46.23/45.18(-1.04)  80.29/84.98(4.69)
Mean Test Square Error: 0.000166

I uploaded these results to kaggle and it give mean top 16 position with 3.3 error. Wow! Not bad 🙂

P.S: I have played with this dataset using multiple neuron layers to get maximum accuracy by implementing multiple hidden layers. It seems perceptron model performed the best as far as the problem goes.

Deep Learning

Deep learning is made up multiple layers convolution and pooling layers (which are like feature extractors), finally followed by fully connected layers (IP) which in turn connects to output. The multiple pooling and convolution layers also a technique to reduce the dimensionality of images. A good tutorial is found here: http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

We will see how caffe performs and extract the facial keypoints by running a DNN in a GPU. I have used nvidia 750ti GPU with CUDA installed in a ubuntu system.

Lets say if the image dimensions are (M,N), each convolution layer with kernel size (x,y) produces (M-x+1, N-y+1) with number of outputs. Pooling greatly bifurcates the pixel size. After each pooling with kernel size K x K, the resulting size becomes (M/k, N/k) with number of outputs

This is what we do..

1. Import the images from kaggle CSV, we drop NaNs and do preprocessing between 0 and 1 (divide by 255) for input and output labels (divide by 96), then equalize the histogram and finally dump the numpy array of the input and output labels to HDF5 files.

X is input and y is output label shape

Input Train, Test shapes (X,y): (1600, 1, 96, 96) (1600, 30) (540, 1, 96, 96) (540, 30)

2. We write a layer file with convolution and pooling layers and also a solver file with EUCLIDEAN_LOSS with 30 outputs for caffe.

3. Train the caffe and and finally predict the outputs by loading chunks of data (64 batches) into MEMORY_LAYER. This way the model can run into a small PCs with atleast 2GB of memory size.

4. To make the model more accurate, we also give the model more 15 extra features like computing euclidean distances from nose to eyes, two eye centers etc.

5. We specify RELU (to allow values > 0, plus faster converging) , Dropout layer to prevent overfitting.

6. We can calculate distances in facial points and give them as feature vector to the learning model. We will save it for later, but lets see how our small model works..

See the my git project: https://github.com/olddocks/caffe-facialkp

This is how you run the project.

python fkp.py
 ./facialkp
 python output.py

The following is a brief description of the files in the project and how to use them

fkp.py -> to write and prepare all data to hd5
./facialkp -> Run the caffe model
output.py -> Predict and plot graphs in simple 64 batches. it writes into csv
solver.prototxt – > Edit this for maximum iterations, gamma, learning rate etc.
facialkp.prototxt -> Layer file for training
facialkp_predict -> Layer file for predictions
kaggle.py -> writes kaggle output to upload (you have manually edit csv files to add header labels, if not it will not work. Sorry i am a lazy coder :()

Our 4D data dimensions

Input data (1600,1,96,96) -> Channel 1 for grayscale images
Output labels (1600,30)

Our solver.prototxt, we specify GPU/CPU

net: "/home/pbu/Desktop/facialkp.prototxt"
test_iter: 6
test_interval: 100
base_lr: 0.01
lr_policy: "fixed"
weight_decay: 0.01
stepsize: 300
gamma: 0.1  
display: 100
max_iter: 2000
momentum: 0.9
snapshot: 1000
snapshot_prefix: "/home/pbu/Desktop/tmp"
solver_mode: GPU

Update: Please note i have discovered that IP layer and convolutional layers produce unexpected strange results because of the bias and weights are not initialized properly. You will need to include this code in your layer file

weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }

Our training layer file

name: "FKPReg"
layers {
  name: "fkp"
  top: "data"
  top: "label"
  type: HDF5_DATA
  hdf5_data_param {
   source: "train.txt"
   batch_size: 64
  }
    include: { phase: TRAIN }
}
layers {
  name: "data"
  type: HDF5_DATA
  top: "data"
  top: "label"
  hdf5_data_param {
    source: "test.txt"
    batch_size: 100 
  }
  include: { phase: TEST }
}
layers {
  name: "conv1"
  type: CONVOLUTION
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 32
    kernel_size: 11
    stride: 2
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layers {
  name: "relu2"
  type: RELU
  bottom: "conv1"
  top: "conv1"
}
layers {
  name: "pool1"
  type: POOLING
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layers {
  name: "conv2"
  type: CONVOLUTION
  bottom: "pool1"
  top: "conv2"
  convolution_param {
    num_output: 64
    pad: 2
    kernel_size: 7
    group: 2
       bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layers {
  name: "relu2"
  type: RELU
  bottom: "conv2"
  top: "conv2"
}
layers {
  name: "pool2"
  type: POOLING
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layers {
  name: "norm2"
  type: LRN
  bottom: "pool2"
  top: "norm2"
  lrn_param {
    norm_region: WITHIN_CHANNEL
    local_size: 3
    alpha: 5e-05
    beta: 0.75
  }
}
layers {
  name: "conv3"
  type: CONVOLUTION
  bottom: "norm2"
  top: "conv3"
  convolution_param {
    num_output: 32
    pad: 1
    kernel_size: 5
    
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layers {
  name: "relu3"
  type: RELU
  bottom: "conv3"
  top: "conv3"
}

layers {
  name: "conv4"
  type: CONVOLUTION
  bottom: "conv3"
  top: "conv4"
  convolution_param {
    num_output: 64
    pad: 1
    kernel_size: 5
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layers {
  name: "relu4"
  type: RELU
  bottom: "conv4"
  top: "conv4"
}
layers {
  name: "conv5"
  type: CONVOLUTION
  bottom: "conv4"
  top: "conv5"
  convolution_param {
    num_output: 32
    pad: 1
    kernel_size: 5
        bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layers {
  name: "relu5"
  type: RELU
  bottom: "conv5"
  top: "conv5"
}
layers {
  name: "pool5"
  type: POOLING
  bottom: "conv5"
  top: "pool5"
  pooling_param {
    pool: MAX
    kernel_size: 4
    stride: 2
  }
}
layers {
  name: "drop0"
  type: DROPOUT
  bottom: "pool5"
  top: "pool5"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  name: "ip1"
  type: INNER_PRODUCT
  bottom: "pool5"
  top: "ip1"
  inner_product_param {
    num_output: 100
        bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layers {
  name: "relu4"
  type: RELU
  bottom: "ip1"
  top: "ip1"
}
layers {
  name: "drop1"
  type: DROPOUT
  bottom: "ip1"
  top: "ip1"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  name: "ip2"
  type: INNER_PRODUCT
  bottom: "ip1"
  top: "ip2"
  inner_product_param {
    num_output: 30        
    weight_filler {
    type="xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layers {
  name: "relu22"
  type: RELU
  bottom: "ip2"
  top: "ip2"
}
layers {
  name: "loss"
  type: EUCLIDEAN_LOSS
  bottom: "ip2"
  bottom: "label"
  top: "loss"
}

We run caffe and we see output like this. We can clearly see how convolution and pool layers shape the data.

I0223 18:26:03.296244  3233 net.cpp:67] Creating Layer data
 I0223 18:26:03.296267  3233 net.cpp:356] data -> data
 I0223 18:26:03.296294  3233 net.cpp:356] data -> label
 I0223 18:26:03.296316  3233 net.cpp:96] Setting up data
 I0223 18:26:03.296334  3233 hdf5_data_layer.cpp:57] Loading filename from test.txt
 I0223 18:26:03.296385  3233 hdf5_data_layer.cpp:69] Number of files: 1
 I0223 18:26:03.296402  3233 hdf5_data_layer.cpp:29] Loading HDF5 filefacialkp-test.hd5
 I0223 18:26:03.498864  3233 hdf5_data_layer.cpp:49] Successully loaded 540 rows
 I0223 18:26:03.498987  3233 hdf5_data_layer.cpp:81] output data size: 100,1,96,96
 I0223 18:26:03.499011  3233 net.cpp:103] Top shape: 100 1 96 96 (921600)
 I0223 18:26:03.499032  3233 net.cpp:103] Top shape: 100 45 1 1 (4500)
 I0223 18:26:03.499068  3233 net.cpp:67] Creating Layer conv1
 I0223 18:26:03.499091  3233 net.cpp:394] conv1 <- data
 I0223 18:26:03.499119  3233 net.cpp:356] conv1 -> conv1
 I0223 18:26:03.499153  3233 net.cpp:96] Setting up conv1
 I0223 18:26:03.499348  3233 net.cpp:103] Top shape: 100 64 44 44 (12390400)
 I0223 18:26:03.499392  3233 net.cpp:67] Creating Layer relu1
 I0223 18:26:03.499411  3233 net.cpp:394] relu1 <- conv1
 I0223 18:26:03.499433  3233 net.cpp:345] relu1 -> conv1 (in-place)
 I0223 18:26:03.499455  3233 net.cpp:96] Setting up relu1
 I0223 18:26:03.499476  3233 net.cpp:103] Top shape: 100 64 44 44 (12390400)
 I0223 18:26:03.499498  3233 net.cpp:67] Creating Layer pool1
 I0223 18:26:03.499516  3233 net.cpp:394] pool1 <- conv1
 I0223 18:26:03.499536  3233 net.cpp:356] pool1 -> pool1
 I0223 18:26:03.499557  3233 net.cpp:96] Setting up pool1
 I0223 18:26:03.499583  3233 net.cpp:103] Top shape: 100 64 22 22 (3097600)
 I0223 18:26:03.499615  3233 net.cpp:67] Creating Layer conv2
 I0223 18:26:03.499632  3233 net.cpp:394] conv2 <- pool1
 I0223 18:26:03.499654  3233 net.cpp:356] conv2 -> conv2
 I0223 18:26:03.499677  3233 net.cpp:96] Setting up conv2
 I0223 18:26:03.500277  3233 net.cpp:103] Top shape: 100 64 22 22 (3097600)
 I0223 18:26:03.500363  3233 net.cpp:67] Creating Layer relu2
 I0223 18:26:03.500388  3233 net.cpp:394] relu2 <- conv2
 I0223 18:26:03.500412  3233 net.cpp:345] relu2 -> conv2 (in-place)
 I0223 18:26:03.500438  3233 net.cpp:96] Setting up relu2
 I0223 18:26:03.500460  3233 net.cpp:103] Top shape: 100 64 22 22 (3097600)
 I0223 18:26:03.500483  3233 net.cpp:67] Creating Layer pool2
 I0223 18:26:03.500500  3233 net.cpp:394] pool2 <- conv2
 I0223 18:26:03.500521  3233 net.cpp:356] pool2 -> pool2
 I0223 18:26:03.500545  3233 net.cpp:96] Setting up pool2
 I0223 18:26:03.500569  3233 net.cpp:103] Top shape: 100 64 11 11 (774400)
 I0223 18:26:03.500591  3233 net.cpp:67] Creating Layer relu3
 I0223 18:26:03.500607  3233 net.cpp:394] relu3 <- pool2
 I0223 18:26:03.500627  3233 net.cpp:345] relu3 -> pool2 (in-place)
 I0223 18:26:03.500648  3233 net.cpp:96] Setting up relu3
 I0223 18:26:03.500669  3233 net.cpp:103] Top shape: 100 64 11 11 (774400)
 I0223 18:26:03.500694  3233 net.cpp:67] Creating Layer conv3
 I0223 18:26:03.500711  3233 net.cpp:394] conv3 <- pool2
 I0223 18:26:03.500732  3233 net.cpp:356] conv3 -> conv3
 I0223 18:26:03.500756  3233 net.cpp:96] Setting up conv3
 I0223 18:26:03.501147  3233 net.cpp:103] Top shape: 100 128 11 11 (1548800)
 I0223 18:26:03.501216  3233 net.cpp:67] Creating Layer pool3
 I0223 18:26:03.501240  3233 net.cpp:394] pool3 <- conv3
 I0223 18:26:03.501266  3233 net.cpp:356] pool3 -> pool3
 I0223 18:26:03.501291  3233 net.cpp:96] Setting up pool3
 I0223 18:26:03.501317  3233 net.cpp:103] Top shape: 100 128 6 6 (460800)
 I0223 18:26:03.501343  3233 net.cpp:67] Creating Layer relu4
 I0223 18:26:03.501359  3233 net.cpp:394] relu4 <- pool3
 I0223 18:26:03.501379  3233 net.cpp:345] relu4 -> pool3 (in-place)
 I0223 18:26:03.501399  3233 net.cpp:96] Setting up relu4
 I0223 18:26:03.501421  3233 net.cpp:103] Top shape: 100 128 6 6 (460800)
 I0223 18:26:03.501446  3233 net.cpp:67] Creating Layer conv4
 I0223 18:26:03.501463  3233 net.cpp:394] conv4 <- pool3
 I0223 18:26:03.501485  3233 net.cpp:356] conv4 -> conv4
 I0223 18:26:03.501515  3233 net.cpp:96] Setting up conv4
 I0223 18:26:03.501665  3233 net.cpp:103] Top shape: 100 100 8 8 (640000)
 I0223 18:26:03.501701  3233 net.cpp:67] Creating Layer drop0
 I0223 18:26:03.501719  3233 net.cpp:394] drop0 <- conv4
 I0223 18:26:03.501739  3233 net.cpp:345] drop0 -> conv4 (in-place)
 I0223 18:26:03.501761  3233 net.cpp:96] Setting up drop0
 I0223 18:26:03.501780  3233 net.cpp:103] Top shape: 100 100 8 8 (640000)
 I0223 18:26:03.501801  3233 net.cpp:67] Creating Layer ip1
 I0223 18:26:03.501818  3233 net.cpp:394] ip1 <- conv4
 I0223 18:26:03.501839  3233 net.cpp:356] ip1 -> ip1
 I0223 18:26:03.501924  3233 net.cpp:96] Setting up ip1
 I0223 18:26:03.505934  3233 net.cpp:103] Top shape: 100 100 1 1 (10000)
 I0223 18:26:03.506044  3233 net.cpp:67] Creating Layer relu5
 I0223 18:26:03.506064  3233 net.cpp:394] relu5 <- ip1
 I0223 18:26:03.506088  3233 net.cpp:345] relu5 -> ip1 (in-place)
 I0223 18:26:03.506110  3233 net.cpp:96] Setting up relu5
 I0223 18:26:03.506141  3233 net.cpp:103] Top shape: 100 100 1 1 (10000)
 I0223 18:26:03.506165  3233 net.cpp:67] Creating Layer drop1
 I0223 18:26:03.506181  3233 net.cpp:394] drop1 <- ip1
 I0223 18:26:03.506199  3233 net.cpp:345] drop1 -> ip1 (in-place)
 I0223 18:26:03.506218  3233 net.cpp:96] Setting up drop1
 I0223 18:26:03.506237  3233 net.cpp:103] Top shape: 100 100 1 1 (10000)
 I0223 18:26:03.506265  3233 net.cpp:67] Creating Layer ip2
 I0223 18:26:03.506281  3233 net.cpp:394] ip2 <- ip1
 I0223 18:26:03.506302  3233 net.cpp:356] ip2 -> ip2
 I0223 18:26:03.506325  3233 net.cpp:96] Setting up ip2
 I0223 18:26:03.506366  3233 net.cpp:103] Top shape: 100 60 1 1 (6000)
 I0223 18:26:03.506391  3233 net.cpp:67] Creating Layer relu6
 I0223 18:26:03.506407  3233 net.cpp:394] relu6 <- ip2
 I0223 18:26:03.506425  3233 net.cpp:345] relu6 -> ip2 (in-place)
 I0223 18:26:03.506445  3233 net.cpp:96] Setting up relu6
 I0223 18:26:03.506464  3233 net.cpp:103] Top shape: 100 60 1 1 (6000)
 I0223 18:26:03.506484  3233 net.cpp:67] Creating Layer drop3
 I0223 18:26:03.506500  3233 net.cpp:394] drop3 <- ip2
 I0223 18:26:03.506520  3233 net.cpp:345] drop3 -> ip2 (in-place)
 I0223 18:26:03.506538  3233 net.cpp:96] Setting up drop3
 I0223 18:26:03.506554  3233 net.cpp:103] Top shape: 100 60 1 1 (6000)
 I0223 18:26:03.506575  3233 net.cpp:67] Creating Layer ip3
 I0223 18:26:03.506592  3233 net.cpp:394] ip3 <- ip2
 I0223 18:26:03.506611  3233 net.cpp:345] ip3 -> ip2 (in-place)
 I0223 18:26:03.506631  3233 net.cpp:96] Setting up ip3
 I0223 18:26:03.506662  3233 net.cpp:103] Top shape: 100 45 1 1 (4500)
 I0223 18:26:03.506687  3233 net.cpp:67] Creating Layer loss
 I0223 18:26:03.506703  3233 net.cpp:394] loss <- ip2
 I0223 18:26:03.506721  3233 net.cpp:394] loss <- label
 I0223 18:26:03.506742  3233 net.cpp:356] loss -> loss
 I0223 18:26:03.506762  3233 net.cpp:96] Setting up loss
 I0223 18:26:03.506783  3233 net.cpp:103] Top shape: 1 1 1 1 (1)
 I0223 18:26:03.506799  3233 net.cpp:109]     with loss weight 1

Finally, train and test error which is 0.029*96 = 2.7, which is too high and the network is not converging and seems to learn only a bit.

I0223 19:10:08.230324  3702 solver.cpp:342] Snapshotting solver state to /home/pbu/Desktop/tmp_iter_1000.solverstate
 I0223 19:10:08.234257  3702 solver.cpp:264] Iteration 1000, Testing net (#0)
 I0223 19:10:08.384732  3702 solver.cpp:315] Test net output #0: loss = 0.0294809 (* 1 = 0.0294809 loss)

It shows that we get test error 0.029. We multiply by 96 (because we normalized it) and the result is 2.7 error, Wow, much better than the simple fann model. But wait? is it overfitting, you will know when you validate the answers with kaggle.

Lets take a look at the couple of predicted outputs from kaggle test set 0-64

Lets take a look at Set 100-164 (bit of orientation and tilt)

Lets take a look at the test set 200-264 (it gets more complex) and more errors in predictions.

Now more complicated images in set 400-644, flipped right and left and our model is not doing good with respect to orientation and angles of faces.

I uploaded the results to Kaggle and i got final prediction error loss of 4.6. Looks like our simple FANN model with 3.3 blows away the Deep learning with test error 2.7?

Why do you think our caffe model performing poorly? One of the reasons is because we dropped 70% of the training data because of the NaNs. We must use the data by filling up the mean and may be the error loss will improve.

So far we used the dataset dropping all NaNs. Lets see we use 100% of the dataset and see if there is any improvement in the model. We interpolate the mean values of the missing values and train the model. You can make changes in fkp.py

Train (X,y): (6000, 1, 96, 96) (6000, 30)
Test (X,y) : (1049, 1, 96, 96) (1049, 30)

After 1000 iterations, with learning rate of 0.01

I0224 19:10:10.937978 3639 solver.cpp:246] Iteration 1000, loss = 0.0102475
I0224 19:10:10.938068 3639 solver.cpp:264] Iteration 1000, Testing net (#0)
I0224 19:10:11.084893 3639 solver.cpp:315] Test net output #0: loss = 0.00789451 (* 1 = 0.00789451 loss)

After 1000 iterations, the test error is 0.00789 multiplied by 96 = 0.75, looks very minimum error compared to 2.7 (with small dataset). Lets not get fooled by the model. Let plot the x,y predictions and see if it is accurate

Looking the plots, the model predictions are far worse than the previous ones, despite a low test error.

From my observations the most accurate model was dataset created by dropping NaNs, with 500 iterations of learning rate of 0.01 (see below). Too fast learning makes the model to converge to mean which means the model is memorizing the mean. I could only squeeze to get kaggle result of 4.4 of the caffe model.

All the code files can be accessed at

Facial keypoints Fann library: https://github.com/olddocks/facialkeypoints
Deep learning in Caffe: https://github.com/olddocks/caffe-facialkp

A much better approach to facial points extraction model documented by Daniel Nouri and has high accuracy http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

Caffe has a google public group for discussions. You can ask questions there as caffe lacks full documentation on regression and other topics: https://groups.google.com/forum/#!forum/caffe-users