Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me

How to run scikit knn demo from

On 2015-02-20 I gave a presentation at the Silicon Valley Machine Learning Meetup.

The presentation explained how to operate some of the Machine Learning software I wrote to generate predictions served by:

The page you see here summarizes that presentation.

I run this software on Ubuntu 14 and I suggest you get a copy of that running on your laptop.

This software might run on a Mac.

On Ubuntu I create an account called 'ann' like this:
useradd -m -s /bin/bash ann
passwd ann
Then I login to the ann account:
ssh -YA ann@localhost
Next I visit:

I download this script:

Then I run the Anaconda script:
I should see something like this:
ann@feb ~ $ 
ann@feb ~ $ 
ann@feb ~ $ 

snip ...

Do you approve the license terms? [yes|no]
[no] >>> yes

Anaconda will now be installed into this location:

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify an different location below

[/home/ann/anaconda] >>> 
installing: python-2.7.8-1 ...
installing: conda-3.7.0-py27_0 ...
installing: conda-build-1.8.2-py27_0 ...


installing: zlib-1.2.7-0 ...
installing: anaconda-2.1.0-np19py27_0 ...
installing: _cache-0.0-x0 ...
Python 2.7.8 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
Do you wish the installer to prepend the Anaconda install location
to PATH in your /home/ann/.bashrc ? [yes|no]
[no] >>> 

You may wish to edit your .bashrc or prepend the Anaconda install location:

$ export PATH=/home/ann/anaconda/bin:$PATH

Thank you for installing Anaconda!
ann@feb ~ $ bash
ann@feb ~ $ which python
ann@feb ~ $ python
Python 2.7.8 |Anaconda 2.1.0 (64-bit)| (default, Aug 21 2014, 18:22:21) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: and
ann@feb ~ $ 
ann@feb ~ $ 
ann@feb ~ $ 

ann@feb ~ $ 
ann@feb ~ $ cd anaconda/bin
ann@feb ~/anaconda/bin $ 
ann@feb ~/anaconda/bin $ ls -la curl
-rwxr-xr-x 2 ann ann 171632 Sep 11 19:22 curl
ann@feb ~/anaconda/bin $ 
ann@feb ~/anaconda/bin $ mv curl curl_ann
ann@feb ~/anaconda/bin $ 
ann@feb ~/anaconda/bin $ 

Next, I visit this URL:

I copy/paste the python script into a file called and then I inspect the file using cat:
cat ~ann/
# ~/

# This script should look at eur_usd_00.csv and issue predictions.

# I should have Anaconda installed.

# I get Anaconda here:
# I want this from there:
# Install is easy:
# bash

# Ref:
# wget

# Demo:
# cd ~
# ls -la eur_usd_00.csv
# ls -la
# python

import pdb
import datetime
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
import matplotlib
import matplotlib.pyplot as plt

pcount = 100
pcount = 32727
print('I should generate this many predictions:')

# I need lagn()
def lagn(myn,myl):
  myfoot = [myl[0],] * myn
  return((myfoot + myl)[0:len(myl)])

# I need leadn()
def leadn(myn,myl):
  myhead = [myl[len(myl)-1],] * myn
  return((myl + myhead)[myn:(len(myl)+myn)])

# As I follow predictions I should collect my gains:
my1g   = {}
myprob = {}

# I should predict these pairs:
# for pair in ['aud_usd','eur_usd','gbp_usd','nzd_usd','usd_cad','usd_jpy']:
for pair in ['eur_usd']:
  # I should predict 1hr gain starting at these minutes past the hour:
  #for minn in ['00','05','10','15','20','25','30','35','40','45','50','55']:
  for minn in ['00']:
    # I should read data from a file like this:
    # ~/eur_usd_00.csv
    csvfile = '/home/ann/eur_usd_00.csv'
    mydtype = [('pair','S7'),('ydate','S19'),('cp','f8')]
    fxrows  = np.loadtxt(csvfile, dtype=mydtype, delimiter=',')
    print('We have this many observations:')
    if pcount > len(fxrows):
      print('You asked for too many predictions.')
      print('We only have this many observations:')
      print('prediction count + learning observations')
      print('should be less than all observations.')
    print('My 1st prediction should start here:')
    print('My last prediction should end here:')
    print('I am busy now, please wait...')
    # I should get a column of datetimes from the CSV data:
    mydt = [datetime.datetime.strptime(row[1], "%Y-%m-%d %H:%M:%S") for row in fxrows ]
    # Later, I should use mydt to help me plot prices and predictions.
    my1g[pair]   = {}
    myprob[pair] = {}
    cp     = [row[2] for row in fxrows]
    cplead = np.array(leadn(1,cp))
    cplag1 = np.array(lagn( 1,cp))
    cplag2 = np.array(lagn( 2,cp))
    cplag3 = np.array(lagn( 3,cp))
    cplag4 = np.array(lagn( 4,cp))
    cplag5 = np.array(lagn( 5,cp))
    g1     = cp - cplag1
    g2     = cp - cplag2
    g3     = cp - cplag3
    g4     = cp - cplag4
    g5     = cp - cplag5
    gg     = cplead - cp
    allx = np.zeros( (len(g1), 5) )
    allx[:,0] = g1
    allx[:,1] = g2
    allx[:,2] = g3
    allx[:,3] = g4
    allx[:,4] = g5
    ally      = gg > 0
    pend      = len(ally)
    pstart    = pend - pcount
    # I should ensure that is_oos_gap > cplead to avoid leaking oos-data into is-data:
    is_oos_gap = 2
    # for is_rowcount in [1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,6000]:
    for is_rowcount in [1000]:
      if pcount + is_rowcount >= pend:
        pstart = is_rowcount + 11
        pcount = pend - pstart
        print('You asked for too many predictions.')
        print('I dont have enough observations to support that many.')
        print('I will give you this many predictions:')
      # Initialize my1g, myg, myp:
      my1g[pair][is_rowcount]   = [-1]  * len(ally)
      myprob[pair][is_rowcount] = [-1]  * len(ally)
      myg = my1g[pair][is_rowcount]
      myp = myprob[pair][is_rowcount]
      # I should use cp2 to store cumulative gain.
      cp2 = [-1]  * len(ally)
      cp2[pstart] = cp[pstart]
      # Now I should be setup to generate pcount predictions:
      for oos_start in range(pstart,pend):
        # Count backwards from oos_start
        is_start = oos_start - is_rowcount - is_oos_gap
        is_end   = oos_start - is_oos_gap
        myx1     = allx[is_start:is_end,:]
        myy      = ally[is_start:is_end]
        knn1 = KNeighborsClassifier(n_neighbors=len(myx1), weights='distance')
        # I should learn from the past:,myy)
        # I should predict the future:
        x_oos  = allx[oos_start,:]
        upprob = knn1.predict_proba(x_oos)[0,1]
        myp[oos_start] = upprob
        # I should be bullish if upprob > 0.5 else I should be bearish.
        # I should track cumulative gain as I respond to predictions:
        myg[oos_start] = np.sign(upprob - 0.5) * gg[oos_start]
        if (upprob >= 0.5):
          myg[oos_start] = gg[oos_start]
          if oos_start+1 < pend:
            cp2[oos_start+1] = cp2[oos_start] + gg[oos_start]
          myg[oos_start] = -gg[oos_start]
          if oos_start+1 < pend:
            cp2[oos_start+1] = cp2[oos_start] - gg[oos_start]
      print('I just finished.')
      print('For each prediction, I learned from this many observations:')
      print('Results of "Open-long and hold":')
      print('Results of: "Following the predictions":')
      print('I should show final positions:')
      print('Open-long and hold:')
      print('Follow the predictions:')
      # I should plot cp and cp2.
      # cp  should be blue.  cp is open-long and hold.
      # cp2 should be green. cp2 is follow predictions.
      plt.plot(mydt[pstart:pend], cp[pstart:pend], 'b-', mydt[pstart:pend], cp2[pstart:pend], 'g-')
      print('Look here: ')
# bye
Next, I get the file full of eur_usd prices:
I verify its location:
head ~/ann/eur_usd_00.csv
Now I can use Python to learn from those prices and then issue predictions:
ann@feb ~ $ 
ann@feb ~ $ 
ann@feb ~ $ anaconda/bin/python
I should generate this many predictions:
We have this many observations:
My 1st prediction should start here:
('eur_usd', '2009-07-06 10:00:00', 1.3904)
My last prediction should end here:
('eur_usd', '2014-11-28 22:00:00', 1.2451)
I am busy now, please wait...
I just finished.
For each prediction, I learned from this many observations:
Results of "Open-long and hold":
Results of: "Following the predictions":
I should show final positions:
Open-long and hold:
Follow the predictions:
Look here: 
[0.48855563140030833, 0.48785146650572808, 0.4849981226331404, 0.46710512413827809]
ann@feb ~ $ 
ann@feb ~ $ 
ann@feb ~ $ 
Next I inspect the png file created by the script:

Questions for the students?

Describe the use-case(s) behind this script: who would want to run it and why?

This script generated 32,727 predictions. What is an optimal way to visualize those predictions?

This script has triple nested for-loops. Why?

What are the objects that we iterate over in the for-loops?

What in the data is the Machine learning from?

How do I calculate the accuracy of this script?

How do I calculate the effectiveness of this script?

How do I determine if more observations gives me better accuracy/effectiveness?

This script is dealing with a subset of Forex data. What can I infer about the full set?

How would I determine if the Machine Learning algorithm used here is the better than others? Let the syntax do the talking
Blog Contact Posts Questions Tags Hire Me