Table of Contents

Home

Project

Features/Tasks

What is wrong?

We are trying to identify problems or gaps in SOFA Statistics which make it unsuitable for some users. If you don't mind our asking, what were you wanting from SOFA that was lacking? Or was something else a better fit? Please don't spare our feelings - we really need to get this right.

Python code of relevance

http://aliquote.org/memos/2011/02/07/python-for-statistical-computing

Does SOFA do X?

Sorry - not at this stage, but there could well be something like that in the future. There are other open source options you may wish to consider in the meantime. See http://www.openstatistics.com and http://en.wikipedia.org/wiki/List_of_statistical_packages#Open_source.

The longer-term plan is for SOFA Statistics to support plug-ins for a multitude of extra tests. SOFA will supply the data, the variable and value labels, the data type information, and the reports in which to display the output. The plug-in will only have to comply with a simple interface for both input and output. If the plug-in approach attracts developers and statisticians, SOFA should be able to support what you are looking for. The emphasis will remain, however, on making it easy for people to use the most common tests without being overwhelmed.

Misc

“I think that the biggest mistake people make is latching onto the first idea that comes to them and trying to do that. It really comes to a thing that my folks taught me about money. Don’t buy something unless you’ve wanted it three times. Similarly, don’t throw in a feature when you first think of it. Think if there’s a way to generalize it, think if it should be generalized. Sometimes you can generalize things too much. I think like the things in Scheme were generalized too much. There is a level of abstraction beyond which people don’t want to go. Take a good look at what you want to do, and try to come up with the long-term lazy way, not the short-term lazy way.” — Larry Wall - Creator of the Perl programming language

No longer relevant

Table to Long Form

Hi Nick,

I thought I would send a bit of encouragement on the concept. I can see the point of providing a consistent interface that works irrespective of whether the datastore is a database or a spreadsheet, or whatever. I wouldn't use it for Django, but it would have a home in data analysis projects.

We sometimes have to deal with the tedious problem of sucking data out of spreadsheets and into either R or python. As an example of an only tangentially related project that aims to go in the other direction (in this case from spreadsheets into R), Jimmy Oh has been working on the prosaically named TableToLongForm:

https://www.stat.auckland.ac.nz/~joh024/Research/TableToLongForm/TableToLongForm_A1Poster.pdf https://github.com/joh024/TableToLongForm

Ed [(Dragonfly)]

Auto-updated Office Output

From cesar.rabak@gmail.com

A fact of life is the propensity of an analysis of some data reveal some information which when it is seen in the aggregated/summarized/analytical form brings questioning from the supplier of the data which may lead to a correction of the data initially input.

In fact most methodologies have a “data validation” or “preliminary review” where an opportunity to address this is offered to the client¹

The most advanced solution for this has been developed by the team of R software, where they addressed this as “reproducible research”, meaning you create a document whose parts from statistical analyses are linked in such a fashion that if you run some form of automatic procedure the report is regenerated automatically, and if data changes the changes propagate through the report thus updating and correcting it.

Obviously this reduces the chance of introducing errors be it due omissions of corrected tables/graphs as well as eases the versioning of the document (the deliverable).

For an overview, from the R pov, please see: http://cran.r-project.org/web/views/ReproducibleResearch.html, see also the “related links” for more ideas on this.

In a nutshell, most of these solutions have means of introducing special chunks of information between a specific set of marks which are treated, initially, by the word processing tool as a form of comment or macro and a post processing of the document replaces the chunks with the results of analyses coded in the chunks, creating the deliverable.

Now my opinion wearing my dev hat: since SOFA is Python based and there is Python bindings for use in Microsoft Office and Open/Libre Office applications, I do believe a similar solution could be developed to allow SOFA to produce similar results.

Random Samples

Ability to run analyses against a random sample of a large dataset e.g. 5% or some arbitrary percentage. Make it a filter option.

The logical place to access it would probably be the filter dialog. Implementing it would take a bit of thought because of the massive performance differences likely between different strategies. Making some sort of temporary table seems a potentially good strategy, possibly into the database we are working with e.g. MySQL#. It could also be possible to combine filters e.g. I might want to look at gender = 1 for a 5% sample. It would make sense to apply the sampling filter first (resulting in a one off temporary table-making process) and then apply any filters onto that for specific analyses just as we would off a normal, unsampled table.

# Making a temporary table into SQLite could also work (see http://www.sqlite.org/tempfiles.html) but SOFA would need to reset the dbe, con, cur etc information whenever that table was selected. Could be trivial but it would pay to be careful.

http://www.carlj.ca/2007/12/16/selecting-random-records-with-sql/

My instinct is towards letting the SQL database engine do the heavy lifting for which it is optimised.

SELECT Field1, ..., Field2
FROM Table1
ORDER BY Random()
LIMIT 1

Trendlines on Line Charts

It's called a “power” trendline and the equation is: y = E1-10×23.83

The R2 value (whatever that is) is 0.866.

You can choose Exponential, Linear, Logarithmic, Polynomial or Moving Average trendlines but I chose “power” purely because it gave the most pleasing curve…

Easier to Share Value Labels for Different Variables

Perhaps add a button which says use values from a selected variable. Shows user the labels and they can apply them. Can modify and save or just save.

Multiple Response Sets

Approach 1

Here is a scenario (please feel free to make your own - I'm a bit rusty on this):

Favourite cars - can choose multiple. Each question is stored as a separate variable.

Three voters: Marco, Frida & Henry

OK so what we want is to be able to provide a short label e.g. “Car”, a long label e.g. “Favourite Sports Car”, a unique name for the multiple response configuration e.g. “favcar1”, and a table with N and % for each car, total N, and a comment.

Favourite Sports Car
Car             N      %
Ferrari         2   25.0
Porsche         1   12.5
Lamborghini     1   12.5
McLaren         3   37.5
Aston Martin    1   12.5
                8   +
+ Percentages will add up to greater than 100%
when there are multiple responses

The configuration would be stored in the appropriate project vdt (variable details) file. Multiple response configurations might look like this behind the scenes (in a plain text file written in Python):

multresp = {u"favcar1": {u"shortname": u"Car",
                         u"longname": u"Favourite Sports Car", 
                         u"vars": [u"q1a", u"q1b", u"q1c", u"q1d", u"q1e", u"q1f"]},
            u"favpizza": {... etc},
}

It would also be nice to be able to do bar charts of multiple response sets - of either N or %.

e.g. (the real one would be much better looking of course!):

Favourite Sports Car
                                       x
   x                                   x
   x          x            x           x           x 
Ferrari    Porsche    Lamborghini   McLaren   Aston Martin

I think the easiest way to give access to multiresp sets would be in the drop-down lists of variables. Now variables are table-specific - perhaps multiresp sets should be as well. There would be no point having the option of selecting “favcar1 (multiple responses)” if the current table didn't have the relevant variables in it.

Hmmmmm - I wonder if I should leave variable details project-wide or make them table-specific as well. There are pros and cons both ways. There may also be ways of indicating whether a variable detail is project-wide or table-specific (default).

Approach 2

It's more common to be in front of a multiple response set, where you don´t know which are going to be the answers, and besides that, a co-dependency with another previous question. This is what I mean:

Keeping up with the cars example:

The first scenario is 'CARS OWNED' with a preliminary list of 65 brands, so, you cannot preset them as individual questions. Assuming a maximum of 5 mentions per informant, you'll still have:

But now, we can't say that q01a will always have 'FERRARI' as response, actually, q01a could have any of the 65 preliminary brands, or even new brands, so its necessary to be able to add a 'val' list as well…

(I don't know any Python, but I think this would be the idea…)

multresp = {u"favcar1": {u"shortname": u"Car",
                         u"longname": u"Cars owned", 
                         u"vars": [u"q01a", u"q01b", u"q01c", u"q01d", u"q01e"],
                         u"vals": [1: u"Ferrari",2: u"Porsche",3: u"Lamborghini" ]},
            u"favpizza": {... etc},
           }

And know, in this case, I can handle all the mentions that I need.

For this scenario, using same info you provided, I will have the exact same results:

Three voters: Marco, Frida & Henry

(For all cases, q01d and q01e are empty)

CAR OWNED
Car             N      %
Ferrari         2   25.0
Porsche         1   12.5
Lamborghini     1   12.5
McLaren         3   37.5
Aston Martin    1   12.5
                8   *
* Percentages will add up to greater than 100%
when there are multiple responses

There is also an even more common scenario:

'PREFERRED CAR' - single selection - q02

'REASONS OF PREFERENCE' - multiple selection - q03a thru q03e

Once again, in q03a thru q03e, I'm going to have a lot of different answers, and in this case, I can't even think about a preset list… so, it's a completely open question.

So, you will have

REASONS OF PREFERENCE
Car                N      %
Ferrari           
    Beauty         2      100
    Fashion        1       50
    Speed          2      100
    Style          1       50
    Total          2        *
McLaren
    Performance    1      100
    Speed          1      100
    Tradition      1      100
    Total          1        *
   
* Percentages will add up to greater than 100%
when there are multiple responses

Basically, both scenarios depend on being able to add 'Value Labels' to each question so, there's no dependency between position and value.

With this last scenario, the chart will also change from the common one, because, instead of making a 'TOTAL REASONS' chart, it should be able to graph up, nesting the results as the way is nested on tables, getting 2 charts for the same question.

Now, talking about the way to give access, i think a dropdown list would be ok, but enabling multiple selection as well, so if the multresp set you are gonna create has 15 variables, you can click on the 15 vars, without having to open the dialog again. Also, out of personal experience (jejeje) the chance of removing one or more vars from the list at the same time, without having to reopen the dialog again, it happens a lot that when adding multiple vars to a list, you end up picking more than needed :-)

I really think that 'variable details' should be table-specific, because actually, even though I always have a ver called AGE(group) in all my projects, the groups perse are not the same in any one of them, even 'GENDER' because sometimes 1: u'MALE', 2: u'FEMALE' and sometimes is blackguards, so, I think at least, there should be an option, where you can decide using table-specific or global. (for me, global it's not an option ;P )

Copy and Paste Into and From SOFA Data Tables

Original user suggestion: Copying and pasting from spreadsheet into SOFA. For example, it is very convenient to copy data from a spreadsheet that you’ve created in software like excel because those are very nice for data arrangement and input. Simple as that, copy column and paste it into sofa. In this way it is possible to create table in few click by combining the power of data listing in for example excel and statistical calculations in sofa.

Building blocks for implementing copy and pasting:

The following looked interesting too:

http://ginstrom.com/scribbles/2008/09/07/getting-the-selected-cells-f...

http://wiki.wxpython.org/wxGrid#Selection_Set_Management

From: http://www.blog.pythonlibrary.org/2010/03/18/wxpython-an-introduction...

And this project: https://sourceforge.net/projects/pywxgridedit/. From the site, it “Enhances wx.grid to provide functionality for cut/ copy/paste and undo/redo. Implemented as mixin.”

Here is the most recent code (about 6 months old): http://pywxgridedit.svn.sourceforge.net/viewvc/pywxgridedit/trunk/pyw...

Histogram-type Frequency Groupings

When running a frequency table, have option of running frequencies on auto-generated bins.

Quicken importer(/exporter)

A quicken/QIF importer could be a nice feature for the future :-)

http://allmybrain.com/2009/02/04/converting-financial-csv-data-to-ofx-or-qif-import-files/

Export to PDF

The best option is probably the on which uses WebKit to render and then turns the (Webkit-guaranteed perfect) result into PDF:

wkhtmltopdf

User suggestion (Marek Paweł Karczewski marek_karczewski@yahoo.com.au): Maybe you could make an option: “Export charts to separate files” and “Export tables to separate files”. Then save each table or file to a separate file, preferably in pdf format. For instance me; I have written a thesis in latex, using lyx. I needed to show a couple of charts. Lyx likes the input format of graphics to be PDF.

Also be aware of alternatives:

Gael Varoquaux (core developer for Mayavi2) has developed a very helpful ad-hoc implementation of “pyreport”.

pyreport is a program that runs a python script and captures its output, 
compiling it to a pretty report in a pdf or an html file. It can display the 
output embedded in the code that produced it and can process special comments 
(literate comments) according to markup languages (rst or LaTeX) to compile a 
very readable document.

It can also embed pretty plots from captured matplotlib output within the compiled document. Really, quite neat:

pyreport

dompdf

Export to SPSS/SAS

Oracle Connection

Hi Kris,

OK - I'll start with 10g.

Re: testing I don't mean anything fancy - just a simple python script something like this:

import cx_Oracle

...
...connect()
def get_tbls(...)
    ....
    print("Tables: %s" % tbls)
get_tbls()
def get_dbs(...)
    ....
    print("Databases: %s" % dbs)
get_dbs()

At this stage the task is simply to see if I can flesh out the SOFA Statistics database plug-in API.

All the best, Grant

On 27/01/10 04:49, Kris Bravo wrote:

If there are any issues to avoid you could start with 10g express, 10g
cx_Oracle then work out any kinks with switching python libraries.
Other than that, it looks like a straight forward process.

Regarding tests, I haven't look at your existing test cases much, but
if I find any issues connecting to live data I'd like to give you some
sort of script or test case that you can repeat. Do you have some
thoughts on this? A UI runner, or integration test stack?

Kris

On Jan 24, 8:41 pm, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:

> Hi Kris,


Thanks for the feedback. Would the following be a good strategy:

1) I install 10g express edition for testing
2) I install the 11g version of cx_Oracle
3) I start building a module called test_oracle.py which starts fleshing
out the parts behind the SOFA Statistics database plug-in API e.g.
listing available databases, tables, fields etc. You help me test them
(very simple process) as we go.
4) I build a working version of dbe_oracle.py which is automatically
included in the default SOFA Statistics installation packages
5) Testing occurs and then the module is included in the next available
release

Feel free to suggest any changes.

All the best, Grant

Kris Bravo wrote:


Hi Grant,


It is pretty common to connect to databases spanning the different
versions. I checked my local installation and it is the 10g XE install
- You should be pretty safe developing against that. The java drivers
tend to support lower versions so hopefully the python library you
pointed out has the same trait.


Regarding actual installations, the user would only have a single
version installed. They tend to occupy a lot of ports and service
names.


More later when I get a free moment.


Kris Bravo

On Jan 24, 3:29 am, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:



Hi Kris,


Excellent - I will start doing some experimenting. Please forgive me if
I ask a few “dumb” questions along the way about Oracle databases,
typical Oracle database scenarios etc.


Q1: would many users run more than one version of Oracle e.g. 10g and
11g? Is it even possible to do that?
I am hoping not. One option is to let users decide which version of
cx_Oracle to install during the SOFA Statistics installation. I get the
feeling that the different versions of cx_Oracle are not meant to
co-exist but I would love to be proven wrong.


Here is something which suggests you get cx_Oracle for one version only:


Just go to http://www.computronix.com/utilities.shtml and download a
Windows installer or a Linux RPM file matching your versions of Python
and Oracle.
(http://www.ioug.org/python.pdf)


Ideally, the 11g version would handle all versions below it but I
haven't found any documentation to that effect. If it does, I will only
ever have to install one and not bother the user with any decisions. If
not, users will only be able to use the version they have selected at
installation. Unless there is some way of installing multiple cx_Oracle
modules alongside each other and calling them separately. The standard
approach is to:
import cx_Oracle
which doesn't indicate the ability to select version. I will need to do
some experiments on this. And to do that I will need at least one
Oracle database to test on. Which leads to …


Q2: I have found a link to the 10g express edition. But no luck with an
11g version. Is that likely to exist? Would it surprise you as an
Oracle user if there were no 11g express version?http://www.oracle.com/technology/products/database/xe/index.html


I am probably only going to be able to test against the 10g version
myself and rely on others for feedback on the 9i and 11g versions.


Any feedback/advice etc from Python users with Oracle experience is welcome.


All the best, Grant

Kris Bravo wrote:



Hi Grant,


Our environment is Oracle 11g.


I fetched a copy of the source to catch up on how you have the modules
organised and to help test. We're using a variety of Oracle
deployments here ranging from local XE installs to RACs with failover.
I ought to be able to provide some good feedback.


Kris

On Jan 23, 9:47 pm, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:

Hi Kris,


You're right - I haven't made an Oracle module yet (all available
modules are in the dbe_plugins subfolder). Can I conclude from your
question that you use Oracle yourself? If so, which version(s)? And
would you be able to answer some basic Oracle questions for me
(orientation rather than detailed technical questions). And possibly
help me do some basic testing of a new module? Or would anyone else on
this group be able to or know anyone who could? As a first step I would
try to get some standalone functions working which list tables, list
fields etc. Then I would put it all together in a dbe_oracle.py module
ready to just plug into SOFA Statistics (or possibly, separate
dbe_oracle_9i.py, dbe_oracle_10g.py, and dbe_oracle_11g.py modules if
necessary). Anyway, here are a couple of useful links for future
reference:
page:http://cx-oracle.sourceforge.net/ - has separate installers
by Oracle version.http://cx-oracle.sourceforge.net/html/index.html


My immediate priority at the moment is to get a Mac installer package
built for SOFA. But I am happy to start the ball rolling on an Oracle
database plug-in if I can get a little bit of advice e.g. which versions
to cover (9i, 10g, 11g).


All the best, Grant

Kris Bravo wrote:



I looked for some indication that Sofa supports connecting to an
Oracle database but couldn't find anything. Is this a feature that's
available/in the works/scoped for the future?


Kris

Table hover

Used jquery

Cross-platform Excel

Workbook

get_sheet_names

getsheet

Worksheet

get_data_rows_n

get_fld_names

iter

Ubuntu has python-xlrd so make it a dependency in the Deb. Add Windows installer for xlrd. What about Mac?

xlrd Author: John Machin Licence: BSD

  import xlrd
  book = xlrd.open_workbook("myfile.xls")
  print "The number of worksheets is", book.nsheets
  print "Worksheet name(s):", book.sheet_names()
  sh = book.sheet_by_index(0)
  print sh.name, sh.nrows, sh.ncols
  print "Cell D30 is", sh.cell_value(rowx=29, colx=3)
  for rx in range(sh.nrows):
      print sh.row(rx)
  # Refer to docs for more details.
  # Feedback on API is welcomed.

Old Version Plans

Version 1.1

  1. Trendlines option for single line charts
  2. Ability to export individual output and entire reports as PDF (see Export to PDF), and possibly individual chart output as images.
  3. Improved support for copying and pasting output.
  4. Easier to share value labels between different variables.
  5. Support for multiple response tables
  6. Able to explicitly select language preference

Version 1.2

  1. creating a plug-in system for adding extra stats tests as modules
  2. Random samples
  3. User-defined missing values
  4. Windows installer knowing if packages are installed

Original Design Goals for SOFA Project