Python Data Analysis With Sqlite And Pandas -- Per Erik Strandberg

As a step in me learning Data Analysis With Python I wanted to

set up a database
write values to it (to fake statistics from production)
read values from the database into pandas
do some filtering with pandas
make a plot with matplotlib.

So this text describes these steps.

Set up the environment

To spice things up I wanted this to run on a raspberry pi (see Dagbok 20151215). I started with the Raspbian Lite image from the official Raspberry pi downloads page (see [1]).

This was a fun but painfully slow way to set up the environment. I should probably have spend twice as much on the micro-SD card to get it faster if I had known this. I also first used Wifi instead of a wired ethernet connection.

After running sudo raspi-config to make use of the entire storage I made an update and installed my favorite desktop environment (Xfce), a nice editor (Gnu Emacs) and the python packages I needed:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install emacs-nox 
sudo apt-get install xfce4 xfce4-goodies xfce4-screenshooter 
sudo apt-get install sqlite3
sudo apt-get install python-scipy
sudo apt-get install python-pandas
sudo apt-get install ...

Set up the database

I wanted to use some form of SQL database, and sqlite is perfect for the job. Since I want to do this programmatically I go through python. In this short example I connect to a (new) database and create a table called sensor.

conn = sqlite3.connect(FILENAME)
cur = conn.cursor()

sql = """
    create table sensor (
        sid         integer primary key not null,
        name        text,
        notes       text
    );"""

_ = cur.execute(sql)

I fill this and the other tables with some values. In fact I do this in a very complicated way just for fun and it turned out to be very, very slow. If you feel like getting the details scroll down and read the code in the full example.

sql = "insert into sensor(sid, name, notes) values (%d, '%s', '%s');"
for (uid, name, notes) in [(201, 'Alpha', 'Sensor for weight'), \
                           (202, 'Beta', 'Sensor for conductivity'),
                           (203, 'Gamma', 'Sensor for surface oxides'),
                           (204, 'Delta', 'Sensor for length'),
                           (205, 'Epsilon', 'Sensor for x-ray'),
                           (206, 'Zeta', 'Color checker 9000'),
                           (207, 'Eta', 'Ultra-violet detector'), ]:
    cur.execute(sql % (uid, name, notes))

The full example build this table, a few others and adds some 700 thousand faked sensor readings to the database. On my Raspberry Pi 2 this requires almost 6 minutes, but that's OK since it is intended to fake 7 years of sensor readings:

$ time python build.py 
create new file /tmp/database.data
insert values from line 1

[...]

real	5m42.281s
user	5m6.020s
sys	0m11.460s

Read values from the database

We want to read the values and I experimented with sqlite default settings in my .sqliterc file. I tried this:

$ cat ~/.sqliterc 
.mode column
.headers on

Anyway, I first try to do some database queries with the command line tool. If you have never used these before, I can only urge you to learn hand-crafting sql queries. It really speeds up debugging and experimentation to have a command line session running in parallel with the code being written. Here is a typical small session:

$ sqlite3 database.data 

sqlite> select * from line;
lid         name        notes                        
----------  ----------  -----------------------------
101         L1          refurbished soviet line, 1956
102         L2          multi-purpose line from 1999 
103         L3          mil-spec line, primary       
104         L4          mil-spec line, seconday

As we saw above, when we created the values, communicating through python is super-easy, so now we want these values to go into pandas for data-analysis. As it turns out: this was also very easy once you figure out how. The tricky part was to figure out that the command I needed was pandas.read_sql(query, conn). This example works fine using IPython (see Ipython First Look), to use the syntax completion features, but it also works in a regular python session, or as a script:

import pandas
import matplotlib.pyplot as plt
import sqlite3

conn = sqlite3.connect('./database.data')

limit = 1000000
query = """
        select
            reading.rid, reading.timestamp, product.name as product,
            serial, line.name as line, sensor.name sensor, verdict
        from
            reading, product, line, sensor
        where
            reading.sid = sensor.sid and
            reading.pid = product.pid and
            reading.lid = line.lid
        limit %s;
        """ % limit

data = pandas.read_sql(query, conn)

We now have very many values in the data structure called data. My poor raspberry pi leaped from 225 MB of used memory to 465 MB, after peaking at more than 500 MB. Remember that this poor computer only has about 925 MB after giving some of it to the GPU.

Let's try to take a look at it by counting the values based on what line and product they represent:

print data.groupby(['line', 'product']).size()

line  product
L1    PA         183364
L2    PA          47247
      PB          57258
      PC         375084
L3    PB           7971
      PC          13311
L4    PD           1389

For someone who has not studied my toy example this means that on for example Line 3 we have recorder 7971 sensor readings on product of type PB and 13311 readings on products of type PC. These values are of course totally irrelevant, but imagine them being real values from a real raspberry pi project in a production site where you are responsible for improving the quality of the physical entities being produced. Then these values might mean that Line 4 is not living up to expectation and could be scrapped, or that product PB on Line L3 should instead be produced on line 4.

Make a plot

I made a bar-chart. But am not too happy with this example, I think the code is too verbose and bulky for a minimal example. Perhaps you can make it prettier. This nicely illustrates the power of scipy.

fig, ax = plt.subplots()
new_set = data.groupby('verdict').size()
width = 0.8
ax.bar([1,2,3], new_set, width=width)
plt.ylabel('Total')
plt.title('Number of sensor readings per outcome')
plt.xticks([1 + width/2, 2+ width/2,3+ width/2],
           ('OK', 'FAIL', 'No read'))

plt.tight_layout()
plt.savefig('python-data-analysis-sqlite-pandas.png')

And here is the plot, as created on a Raspberry Pi:

Summary

Data Analysis With Python is extremely powerful and can be done, with some pain, even on a raspberry pi. Download the full example from here: [2]

My next step is to pretend that the database solution does not scale to the new needs (all the new lines), so a front-end for presenting sensor readings and manually commenting on bad verdicts should be possible. We let this database be a legacy database and use Django: Python Data Presentation With Django And Legacy Sqlite Database

This page belongs in Kategori Programmering
See also Data Analysis With Python