Data Analysis With Python
Background
I have a challenge at work: we have lots of multi-dimensional data that I want to structure and visualize. We create some plots and look at some trends. But I'd like to be better at really analyzing, structuring and visualizing it.
There are of course many ways to do this. A lot of people use Microsoft Excel. But I have millions of data entries so using that kind of a GUI is not an option for me. Others use the R programming language (W: [1]), or Matlab (W: [2]), Mathematica (W: [3]), Maple (W: [4]) or other solutions (W: [5]). These tools might be all right and perfect for the job -- but I have been working with Python (W: [6]) for more than 10 years and I want to dive into the tools available within the Python ecosystem first.
One goal of this page in my blog is to have a starting point for learning these tools better. Right now I am pretty good at using matplotlib, but I want to learn more about the other tools - so I'll try to store my findings so that I can find them here.
Another, perhaps even more important goal, is to learn about data analysis. I studied data mining a bit at the university and I want to revisit the topic and others.
Data Analysis Sub-Processes and topics
According to Wikipedia (W: [7]) there is a process of Data Analysis. It involves collecting raw data from the world, processing this data into a clean dataset. On this set one can perform exploratory data analysis. From the clean data set and the exploratory data analysis we can build models and algorithms. We can now communicate, visualize and report, something that supports making decisions. Before putting data back into the world we can use a data product - this could for example be a e-book store that recommends books based on what you (and others) have previously bought.
Other topics that are included in Data Analysis include:
- Business Intelligence: "...covers data analysis that relies heavily on aggregation, focusing on business information."
- Data Mining: "...a particular data analysis technique that focuses on modeling and knowledge discovery for predictive [...] purposes..."
- Text Mining: "applies statistical, linguistic, and structural techniques to extract and classify information from textual sources..."
- Terminology related to Statistics:
- descriptive statistics
- exploratory data analysis
- confirmatory data analysis
- Predictive analytics: "focuses on application of statistical models for predictive forecasting or classification..."
Commonly mentioned tools
I have looked at the summaries of many books on data analysis and python, in particular from Oreilly (see for example [8] or [9]) and Packt (see for example [10], [11], [12], [13], [14], [15], or Boken Python Data Analysis that I bought and reviewed). (This might look like advertisements but it is not. I have not read any of these books but they seem pretty good. I mention them because they in turn reference interesting parts of the Python ecosystem).
All of these books discuss either data analysis or commonly used python tools for data analysis. The tools are:
- Python, the programming language -- (W: [16], H: [17])
- SciPy, "... is an open source Python library used by scientists, analysts, and engineers doing scientific computing and technical computing... SciPy builds on the NumPy array object and is part of the NumPy stack which includes tools like Matplotlib, pandas and SymPy...". The core project of the SciPy stack are Numpy, Matplotlib, IPython, SymPy and Pandas, so this entire page overlaps pretty well with SciPy. -- (W: [18], H: [19], Scipy Lecture Notes: [20]),
- NumPy, "the fundamental package for scientific computing with Python" -- (W: [21], H: [22]),
- Matplotlib, "...a plotting library for the Python programming language and its numerical mathematics extension NumPy..." -- (W: [23], H: [24])
- pandas, "...a software library written for the Python programming language for data manipulation and analysis..." -- (W: [25], H: [26])
- SymPy, "...a Python library for symbolic computation..." -- (W: [27], H: [28])
- networkx, "...a Python library for studying graphs and networks...." -- (W: [29], H: [30])
Often these are combined with a way to organize python sessions. Instead of a sloppy python command line session we use a notebook metaphor. Some tools that are sometimes mentioned are:
- IPython, "...is a command shell for interactive computing..." -- (W: [31], H: [32])
- IDLE, "...is an integrated development environment for Python, which has been bundled... since 1.5.2b1..." -- (W: [33], H: [34])
Data, Input & Output
I want to explore these tools by working with data. This data comes in many forms and I typically read from one format and export to another. Sometimes the input comes from database queries, and the output is stored in a CSV or YAML format. Typical data formats I want to explore a bit more are:
- Linear Algebra types, for example a vector, matrix, sparse or tensor. The sparse format is a commonly used "skinny" matrix where a large portion of the entries are blank.
- Database, I want to better understand how to transform the typical tuple of tuples structures you get from a database query into something more structured.
- Text files, such as YAML or CSV-files.
- Feeds, perhaps RSS or atom feeds.
- Web formats, perhaps html or json.
Data Analysis Process, Topics, Python Data Analysis Tools and Data Formats
- Data Analysis Process
- collecting raw data
- data processing
- data cleaning
- exploratory data analysis
- data models and algorithms
- data communication, visualization and reporting
- making decisions
- data product
- Data Analysis Topics
- business intelligence
- data mining
- text mining
- descriptive statistics
- exploratory data analysis
- confirmatory data analysis
- predictive analytics
- Python Data Analysis Tools
- SciPy
- NumPy
- Matplotlib
- pandas
- SymPy
- networkx
- IPython
- IDLE
- Data Formats
- Linear Algebra types
- Database
- Text files
- Feeds
- Web formats
This page has a lot of links. A convention I use is W for Wikipedia, H for what I guess is the official homepage.
The quotes are all from Wikipedia -- The Free Encyclopedia. Most from the huge article on Data Analysis with more than 300 links to other Wikipedia articles.
This page belongs in Kategori Programmering
This page belongs in Kategori Plot