Features/Tasks

What is wrong?

We are trying to identify problems or gaps in SOFA Statistics which make it unsuitable for some users. If you don't mind our asking, what were you wanting from SOFA that was lacking? Or was something else a better fit? Please don't spare our feelings - we really need to get this right.

Python code of relevance

http://aliquote.org/memos/2011/02/07/python-for-statistical-computing

Does SOFA do X?

Sorry - not at this stage, but there could well be something like that in the future. There are other open source options you may wish to consider in the meantime. See http://www.openstatistics.com and http://en.wikipedia.org/wiki/List_of_statistical_packages#Open_source.

The longer-term plan is for SOFA Statistics to support plug-ins for a multitude of extra tests. SOFA will supply the data, the variable and value labels, the data type information, and the reports in which to display the output. The plug-in will only have to comply with a simple interface for both input and output. If the plug-in approach attracts developers and statisticians, SOFA should be able to support what you are looking for. The emphasis will remain, however, on making it easy for people to use the most common tests without being overwhelmed.

Misc

“I think that the biggest mistake people make is latching onto the first idea that comes to them and trying to do that. It really comes to a thing that my folks taught me about money. Don’t buy something unless you’ve wanted it three times. Similarly, don’t throw in a feature when you first think of it. Think if there’s a way to generalize it, think if it should be generalized. Sometimes you can generalize things too much. I think like the things in Scheme were generalized too much. There is a level of abstraction beyond which people don’t want to go. Take a good look at what you want to do, and try to come up with the long-term lazy way, not the short-term lazy way.” — Larry Wall - Creator of the Perl programming language

Idea from Frantisek - “I've had a good look into implementing item 3) code-wise and it is something I might do at some point. It would involve calculating the bins for the total population first and force-feeding those to the individual histogram chart functions. The main challenges would be around wiring up all the different places in the interface so people can opt in or out of that approach.”
FROM ELLIOT - make one form for all stats tests
Time series charts - ideally Dojo Toolkit but I am open to other possibilities - e.g. http://code.shutterstock.com/rickshaw/
Extend Export Output plug-in so gives option of also exporting tabular output to a spreadsheet.
Slideshows with button which jumps to next under each couple of charts
the ability to share data with other SOFA Statistics users (like SPSS users can send each other sav files)
the ability to export data to spreadsheets and other stats apps
journalling of activity
Clear button for stats tests
Escape < and > in HTML output
NB paired results can be displayed in a clustered bar chart if small enough sample.
Kirikiriroa Maori translation
AJAX interaction with output? Change decimal points? wxWebkit live editing of HTML e.g. adding text. Then click something to add to report.
Sharing data - send SQLite db and a vdt file?
Currency display
Phone home to show which tests are used most often. e.g. Help us improve SOFA Statistics by providing anonymous usage data. The usage statistics help us better understand user requirements and prioritise improvements in future releases. Link - Learn more. … What kind of usage data is being collected? What data is not being collected? e.g. scripts created, login details.
FROM USER - I wouldDojo Toolkit but I am open to other possibilities - e.g. http://code.shutterstock.com/rickshaw/ suggest that there be a chart option along with frequencies and percentages so that the output could include both the freqs or % and the corresponding chart.
Printing output (and not just from expand window).
It would be good to be able to export the output to a Word and Open Office file.
FROM USER - Histograms with sane defaults based on data\ I like the idea about guided procedure for analysis. Further, I would like to see maybe some time in future more graphing possibilities, especially in a field where many other software fail: plotting histograms of distributions of data in logical and simple way, for example, I select a column in excel, paste it in sofa, and then choose histogram and sofa cleverly understands that the most common operation in one set of data is the need of the researcher to see the shape, spread and height of the histogram and it will arrange date like x axis selected data, and y axis numbers of observations (frequencies) and so on…
FROM USER: Cramer's V to be calculated along with Chi Squared
FROM USER - The ability to plot a 3d chart of a function of 2 variables.
FROM USER - Ability to store filters
Ability to push output to the cloud and email someone a link
FROM USER - Add Naive Bayes - crosstab reports provide prior probabilities from large datasets. There are off the shelf Python modules available such as this: http://www.ailab.si/orange/doc/catalog/Classify/NaiveBayes.htm
“Export charts to separate files” and “Export tables to separate files”. Then save each table or file to a separate file, preferably in pdf format. For Latex/Lyx.
Global preference setting for decimal e.g. comma in Germany
Confidence intervals
ROC
Non-inferential e.g. we know the mean, what is the population mean? New button?
CSS - out modifiable bits together.
Allow more options for coloured output body background. Keeps td and th white (table cells).
Switch some parts to use new built-in stats code: http://docs.python.org/3/library/statistics.html#module-statistics
Make importing scriptable e.g. from Google Docs
Header at top of all output reports with SOFA logo and discrete, unobtrusive marketing
New lines in text output are given <br> tags
Post-hoc comparisons for the ANOVA. It does present problems when the data isn’t balanced (unequal sample sizes amongst the treatments) when summarising them but it would be good to see which treatments differ after running the initial F-test. There are some nice ways of summarising the results when the data is balanced and I’m [Mark Wohlers] happy to show you some examples if you're interested. [Also wanted by Iain Gallagher]
Adding blocking to the ANOVA.
FROM USER - ability to import scripts like a Firefox plugin so everyone can use it.
Ability to make charts off pre-prepared data e.g. value and freq columns ready to go.
Add new chart types (preventing SOFA from being the only tool you need to do the introductory stats course that Open Poly is offering)
- stem plot
Add map charts etc etc as per http://mbostock.github.com/d3/ex/choropleth.html
Pivot tables http://pypi.python.org/pypi/collective.pivottable NB GPL
FROM USER - ability to unstack (stack?) data e.g. restructuring it ready for use
Switch Windows to wxWebKit (see http://opensourcepack.blogspot.com/2011/01/building-webkitgtk-on-windows.html etc See SOFA email from Andre Mikulec - andremikulec@gmail.com). See http://code.google.com/p/chromiumembedded/
FROM USER - factorial ANOVA
FROM ELIZABETH - Report Builder. Creates professional database reports from Base. You can define group and page headers, group and page footers, and calculation fields. Export your reports to PDF or OpenDocument formats, or send them as email attachments. http://webcache.googleusercontent.com/search?q=cache:-WyjDfqq7mcJ:www.libreoffice.org/assets/Uploads/EN_Documents/Installingextensions.odt+libreoffice+recommended+extensions&cd=6&hl=en&ct=clnk&client=ubuntu
FROM USER - add automatic recoding(like grouping of age by percentage or no of groups) as in the histogram
FROM USER - basic OLS multivariate regression. Would love for you to include regression solution as well as ttest of regression coefficients, standardized coefficients, R2 and F test as well as a correlation matrix.
FROM USER - ability to toggle off the explanatory footnotes in the HTML
FROM USER - http://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator or http://en.wikipedia.org/wiki/Proportional_hazards_models. Kaplna Meier curves are great for medical stats.
FROM USER - data weighting
Use csvkit instead of csv module?
FROM USER - SPSS/SAS importing
FROM USER - It would be good to be able to create new variables that are transformations of existing ones, e.g., taking the square, square root or natural log of a continuous variable. As it is we could note when a scatterplot depicted a curvilinear relationship between two continuous variables but couldn’t do anything to correct it via transformation of X or Y.
FROM USER - relationships between ordinal variables (gamma and Kendall’s tau_b)
FROM USER - multiple linear regression, nor could we use SOFA to calculate Cramer’s V in the unit on relationships between categorical variables.
FROM USER - Would love to see McNemar's Chi Square implemented in SOFA.
FROM USER - Spider charts. Could use graphviz which has Python bindings and is cross-platform.
FROM USER - Enable display of counts etc on charts, not just Pie Charts.
FROM USER - enable explicit choice of language e.g. English - e.g. if locale translation is rubbish.
FROM USER - Many of us biologists are weak in statistics. But for many of our analyses, all we really need is ANOVA, and then an assortment of parametric and non parametric comparisons of multiple means tests (ranging from conservative to liberal–many prefer Duncan's). If you could include some multiple means tests, I'd switch to SOFA in a heartbeat!

No longer relevant

use different technology for interacting with Excel that works cross-platform even without Excel installed Cross-platform Excel.
- http://assorted-experience.blogspot.com/2008/04/reading-excel-spreadsheets-into-python.html
- http://www.lexicon.net/sjmachin/xlrd.htm
Reduce number of packages Windows installer has to include by using pywin only (see adodbapi developer vernon@digvil.net)
FROM USER - most of the time, when students gather data for a chi-squared test, they just keep a frequency count in a table. SOFA, however, wants the raw data (each individual observation). It would be great if SOFA could somehow allow a frequency table as the input to chi-squared, but that is totally different from the way every other test works.

Table to Long Form

Hi Nick,

I thought I would send a bit of encouragement on the concept. I can see the point of providing a consistent interface that works irrespective of whether the datastore is a database or a spreadsheet, or whatever. I wouldn't use it for Django, but it would have a home in data analysis projects.

We sometimes have to deal with the tedious problem of sucking data out of spreadsheets and into either R or python. As an example of an only tangentially related project that aims to go in the other direction (in this case from spreadsheets into R), Jimmy Oh has been working on the prosaically named TableToLongForm:

https://www.stat.auckland.ac.nz/~joh024/Research/TableToLongForm/TableToLongForm_A1Poster.pdf https://github.com/joh024/TableToLongForm

Ed [(Dragonfly)]

Auto-updated Office Output

From cesar.rabak@gmail.com

A fact of life is the propensity of an analysis of some data reveal some information which when it is seen in the aggregated/summarized/analytical form brings questioning from the supplier of the data which may lead to a correction of the data initially input.

In fact most methodologies have a “data validation” or “preliminary review” where an opportunity to address this is offered to the client¹

The most advanced solution for this has been developed by the team of R software, where they addressed this as “reproducible research”, meaning you create a document whose parts from statistical analyses are linked in such a fashion that if you run some form of automatic procedure the report is regenerated automatically, and if data changes the changes propagate through the report thus updating and correcting it.

Obviously this reduces the chance of introducing errors be it due omissions of corrected tables/graphs as well as eases the versioning of the document (the deliverable).

For an overview, from the R pov, please see: http://cran.r-project.org/web/views/ReproducibleResearch.html, see also the “related links” for more ideas on this.

In a nutshell, most of these solutions have means of introducing special chunks of information between a specific set of marks which are treated, initially, by the word processing tool as a form of comment or macro and a post processing of the document replaces the chunks with the results of analyses coded in the chunks, creating the deliverable.

Now my opinion wearing my dev hat: since SOFA is Python based and there is Python bindings for use in Microsoft Office and Open/Libre Office applications, I do believe a similar solution could be developed to allow SOFA to produce similar results.

Random Samples

Ability to run analyses against a random sample of a large dataset e.g. 5% or some arbitrary percentage. Make it a filter option.

The logical place to access it would probably be the filter dialog. Implementing it would take a bit of thought because of the massive performance differences likely between different strategies. Making some sort of temporary table seems a potentially good strategy, possibly into the database we are working with e.g. MySQL#. It could also be possible to combine filters e.g. I might want to look at gender = 1 for a 5% sample. It would make sense to apply the sampling filter first (resulting in a one off temporary table-making process) and then apply any filters onto that for specific analyses just as we would off a normal, unsampled table.

# Making a temporary table into SQLite could also work (see http://www.sqlite.org/tempfiles.html) but SOFA would need to reset the dbe, con, cur etc information whenever that table was selected. Could be trivial but it would pay to be careful.

http://www.carlj.ca/2007/12/16/selecting-random-records-with-sql/

My instinct is towards letting the SQL database engine do the heavy lifting for which it is optimised.

SELECT Field1, ..., Field2
FROM Table1
ORDER BY Random()
LIMIT 1

Trendlines on Line Charts

It's called a “power” trendline and the equation is: y = E1-10×23.83

The R2 value (whatever that is) is 0.866.

You can choose Exponential, Linear, Logarithmic, Polynomial or Moving Average trendlines but I chose “power” purely because it gave the most pleasing curve…

Easier to Share Value Labels for Different Variables

Perhaps add a button which says use values from a selected variable. Shows user the labels and they can apply them. Can modify and save or just save.

Multiple Response Sets

Approach 1

Here is a scenario (please feel free to make your own - I'm a bit rusty on this):

Favourite cars - can choose multiple. Each question is stored as a separate variable.

q1a Ferrari
q1b Porsche
q1c Lamborghini
q1d McLaren
q1e Aston Martin
q1f Austin Princess

Three voters: Marco, Frida & Henry

Marco votes for Ferrari, Lamborghini, and McLaren
Frida for Ferrari, Porsche and McLaren
Henry for McLaren and Aston Martin

OK so what we want is to be able to provide a short label e.g. “Car”, a long label e.g. “Favourite Sports Car”, a unique name for the multiple response configuration e.g. “favcar1”, and a table with N and % for each car, total N, and a comment.

Favourite Sports Car
Car             N      %
Ferrari         2   25.0
Porsche         1   12.5
Lamborghini     1   12.5
McLaren         3   37.5
Aston Martin    1   12.5
                8   +
+ Percentages will add up to greater than 100%
when there are multiple responses

The configuration would be stored in the appropriate project vdt (variable details) file. Multiple response configurations might look like this behind the scenes (in a plain text file written in Python):

multresp = {u"favcar1": {u"shortname": u"Car",
                         u"longname": u"Favourite Sports Car", 
                         u"vars": [u"q1a", u"q1b", u"q1c", u"q1d", u"q1e", u"q1f"]},
            u"favpizza": {... etc},
}

It would also be nice to be able to do bar charts of multiple response sets - of either N or %.

e.g. (the real one would be much better looking of course!):

Favourite Sports Car
                                       x
   x                                   x
   x          x            x           x           x 
Ferrari    Porsche    Lamborghini   McLaren   Aston Martin

I think the easiest way to give access to multiresp sets would be in the drop-down lists of variables. Now variables are table-specific - perhaps multiresp sets should be as well. There would be no point having the option of selecting “favcar1 (multiple responses)” if the current table didn't have the relevant variables in it.

Hmmmmm - I wonder if I should leave variable details project-wide or make them table-specific as well. There are pros and cons both ways. There may also be ways of indicating whether a variable detail is project-wide or table-specific (default).

Approach 2

It's more common to be in front of a multiple response set, where you don´t know which are going to be the answers, and besides that, a co-dependency with another previous question. This is what I mean:

Keeping up with the cars example:

The first scenario is 'CARS OWNED' with a preliminary list of 65 brands, so, you cannot preset them as individual questions. Assuming a maximum of 5 mentions per informant, you'll still have:

q01a
q01b
q01c
q01d
q01c

But now, we can't say that q01a will always have 'FERRARI' as response, actually, q01a could have any of the 65 preliminary brands, or even new brands, so its necessary to be able to add a 'val' list as well…

(I don't know any Python, but I think this would be the idea…)

multresp = {u"favcar1": {u"shortname": u"Car",
                         u"longname": u"Cars owned", 
                         u"vars": [u"q01a", u"q01b", u"q01c", u"q01d", u"q01e"],
                         u"vals": [1: u"Ferrari",2: u"Porsche",3: u"Lamborghini" ]},
            u"favpizza": {... etc},
           }

And know, in this case, I can handle all the mentions that I need.

For this scenario, using same info you provided, I will have the exact same results:

Three voters: Marco, Frida & Henry

Marco votes for Ferrari (q01a), Lamborghini (q01b), and McLaren (q01c)
Frida for Ferrari (q01a), Porsche (q01b) and McLaren (q01c)
Henry for McLaren (q01a) and Aston Martin (q01b)

(For all cases, q01d and q01e are empty)

CAR OWNED
Car             N      %
Ferrari         2   25.0
Porsche         1   12.5
Lamborghini     1   12.5
McLaren         3   37.5
Aston Martin    1   12.5
                8   *
* Percentages will add up to greater than 100%
when there are multiple responses

There is also an even more common scenario:

'PREFERRED CAR' - single selection - q02

'REASONS OF PREFERENCE' - multiple selection - q03a thru q03e

Once again, in q03a thru q03e, I'm going to have a lot of different answers, and in this case, I can't even think about a preset list… so, it's a completely open question.

Marco votes for Ferrari (q02)
Frida for Ferrari (q02)
Henry for McLaren (q02)

Marco says: speed (q03a), beauty(q03b), style(q03c), fashion(q03d)
Frida: Speed(q03a), beauty(q03b)
Henry: Tradition(q03a), performance(q03b), speed(q03c)

So, you will have

REASONS OF PREFERENCE
Car                N      %
Ferrari           
    Beauty         2      100
    Fashion        1       50
    Speed          2      100
    Style          1       50
    Total          2        *
McLaren
    Performance    1      100
    Speed          1      100
    Tradition      1      100
    Total          1        *
   
* Percentages will add up to greater than 100%
when there are multiple responses

Basically, both scenarios depend on being able to add 'Value Labels' to each question so, there's no dependency between position and value.

With this last scenario, the chart will also change from the common one, because, instead of making a 'TOTAL REASONS' chart, it should be able to graph up, nesting the results as the way is nested on tables, getting 2 charts for the same question.

Now, talking about the way to give access, i think a dropdown list would be ok, but enabling multiple selection as well, so if the multresp set you are gonna create has 15 variables, you can click on the 15 vars, without having to open the dialog again. Also, out of personal experience (jejeje) the chance of removing one or more vars from the list at the same time, without having to reopen the dialog again, it happens a lot that when adding multiple vars to a list, you end up picking more than needed

I really think that 'variable details' should be table-specific, because actually, even though I always have a ver called AGE(group) in all my projects, the groups perse are not the same in any one of them, even 'GENDER' because sometimes 1: u'MALE', 2: u'FEMALE' and sometimes is blackguards, so, I think at least, there should be an option, where you can decide using table-specific or global. (for me, global it's not an option ;P )

Copy and Paste Into and From SOFA Data Tables

Original user suggestion: Copying and pasting from spreadsheet into SOFA. For example, it is very convenient to copy data from a spreadsheet that you’ve created in software like excel because those are very nice for data arrangement and input. Simple as that, copy column and paste it into sofa. In this way it is possible to create table in few click by combining the power of data listing in for example excel and statistical calculations in sofa.

Building blocks for implementing copy and pasting:

grid.GetSelectedCells() # I will probably only allow block-like
selections suitable for pasting into spreadsheets etc
grid.GetSelectedRows() # not necessarily contiguous
grid.GetSelectedRows()# not necessarily contiguous
grid.GetSelectionBlockTopLeft()
grid.GetSelectionBlockBottomRight()
wx.EVT_GRID_RANGE_SELECT
evt.GetTopLeftCoords(),
evt.GetBottomRightCoords()

The following looked interesting too:

http://ginstrom.com/scribbles/2008/09/07/getting-the-selected-cells-f...

http://wiki.wxpython.org/wxGrid#Selection_Set_Management

From: http://www.blog.pythonlibrary.org/2010/03/18/wxpython-an-introduction...

And this project: https://sourceforge.net/projects/pywxgridedit/. From the site, it “Enhances wx.grid to provide functionality for cut/ copy/paste and undo/redo. Implemented as mixin.”

Here is the most recent code (about 6 months old): http://pywxgridedit.svn.sourceforge.net/viewvc/pywxgridedit/trunk/pyw...

Histogram-type Frequency Groupings

When running a frequency table, have option of running frequencies on auto-generated bins.

Quicken importer(/exporter)

A quicken/QIF importer could be a nice feature for the future

http://allmybrain.com/2009/02/04/converting-financial-csv-data-to-ofx-or-qif-import-files/

Export to PDF

The best option is probably the on which uses WebKit to render and then turns the (Webkit-guaranteed perfect) result into PDF:

wkhtmltopdf

User suggestion (Marek Paweł Karczewski marek_karczewski@yahoo.com.au): Maybe you could make an option: “Export charts to separate files” and “Export tables to separate files”. Then save each table or file to a separate file, preferably in pdf format. For instance me; I have written a thesis in latex, using lyx. I needed to show a couple of charts. Lyx likes the input format of graphics to be PDF.

Also be aware of alternatives:

Gael Varoquaux (core developer for Mayavi2) has developed a very helpful ad-hoc implementation of “pyreport”.

pyreport is a program that runs a python script and captures its output, 
compiling it to a pretty report in a pdf or an html file. It can display the 
output embedded in the code that produced it and can process special comments 
(literate comments) according to markup languages (rst or LaTeX) to compile a 
very readable document.

It can also embed pretty plots from captured matplotlib output within the compiled document. Really, quite neat:

pyreport

dompdf

Export to SPSS/SAS

the ability to export data ready for SAS or SPSS (for more advanced analysis e.g. by a specialist biostatistician)

Oracle Connection

Hi Kris,

OK - I'll start with 10g.

Re: testing I don't mean anything fancy - just a simple python script something like this:

import cx_Oracle

...
...connect()
def get_tbls(...)
    ....
    print("Tables: %s" % tbls)

get_tbls()

def get_dbs(...)
    ....
    print("Databases: %s" % dbs)

get_dbs()

At this stage the task is simply to see if I can flesh out the SOFA Statistics database plug-in API.

All the best, Grant

On 27/01/10 04:49, Kris Bravo wrote:

If there are any issues to avoid you could start with 10g express, 10g
cx_Oracle then work out any kinks with switching python libraries.
Other than that, it looks like a straight forward process.

Regarding tests, I haven't look at your existing test cases much, but
if I find any issues connecting to live data I'd like to give you some
sort of script or test case that you can repeat. Do you have some
thoughts on this? A UI runner, or integration test stack?

Kris

On Jan 24, 8:41 pm, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:

> Hi Kris,

Thanks for the feedback. Would the following be a good strategy:

1) I install 10g express edition for testing
2) I install the 11g version of cx_Oracle
3) I start building a module called test_oracle.py which starts fleshing
out the parts behind the SOFA Statistics database plug-in API e.g.
listing available databases, tables, fields etc. You help me test them
(very simple process) as we go.
4) I build a working version of dbe_oracle.py which is automatically
included in the default SOFA Statistics installation packages
5) Testing occurs and then the module is included in the next available
release

Feel free to suggest any changes.

All the best, Grant

Kris Bravo wrote:

Hi Grant,

It is pretty common to connect to databases spanning the different
versions. I checked my local installation and it is the 10g XE install
- You should be pretty safe developing against that. The java drivers
tend to support lower versions so hopefully the python library you
pointed out has the same trait.

Regarding actual installations, the user would only have a single
version installed. They tend to occupy a lot of ports and service
names.

More later when I get a free moment.

Kris Bravo

On Jan 24, 3:29 am, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:

Hi Kris,

Excellent - I will start doing some experimenting. Please forgive me if
I ask a few “dumb” questions along the way about Oracle databases,
typical Oracle database scenarios etc.

Q1: would many users run more than one version of Oracle e.g. 10g and
11g? Is it even possible to do that?
I am hoping not. One option is to let users decide which version of
cx_Oracle to install during the SOFA Statistics installation. I get the
feeling that the different versions of cx_Oracle are not meant to
co-exist but I would love to be proven wrong.

Here is something which suggests you get cx_Oracle for one version only:

Just go to http://www.computronix.com/utilities.shtml and download a
Windows installer or a Linux RPM file matching your versions of Python
and Oracle.
(http://www.ioug.org/python.pdf)

Ideally, the 11g version would handle all versions below it but I
haven't found any documentation to that effect. If it does, I will only
ever have to install one and not bother the user with any decisions. If
not, users will only be able to use the version they have selected at
installation. Unless there is some way of installing multiple cx_Oracle
modules alongside each other and calling them separately. The standard
approach is to:
import cx_Oracle
which doesn't indicate the ability to select version. I will need to do
some experiments on this. And to do that I will need at least one
Oracle database to test on. Which leads to …

Q2: I have found a link to the 10g express edition. But no luck with an
11g version. Is that likely to exist? Would it surprise you as an
Oracle user if there were no 11g express version?http://www.oracle.com/technology/products/database/xe/index.html

I am probably only going to be able to test against the 10g version
myself and rely on others for feedback on the 9i and 11g versions.

Any feedback/advice etc from Python users with Oracle experience is welcome.

All the best, Grant

Kris Bravo wrote:

Hi Grant,

Our environment is Oracle 11g.

I fetched a copy of the source to catch up on how you have the modules
organised and to help test. We're using a variety of Oracle
deployments here ranging from local XE installs to RACs with failover.
I ought to be able to provide some good feedback.

Kris

On Jan 23, 9:47 pm, Grant Paton-Simpson<gr…@p-s.co.nz> wrote:

Hi Kris,

You're right - I haven't made an Oracle module yet (all available
modules are in the dbe_plugins subfolder). Can I conclude from your
question that you use Oracle yourself? If so, which version(s)? And
would you be able to answer some basic Oracle questions for me
(orientation rather than detailed technical questions). And possibly
help me do some basic testing of a new module? Or would anyone else on
this group be able to or know anyone who could? As a first step I would
try to get some standalone functions working which list tables, list
fields etc. Then I would put it all together in a dbe_oracle.py module
ready to just plug into SOFA Statistics (or possibly, separate
dbe_oracle_9i.py, dbe_oracle_10g.py, and dbe_oracle_11g.py modules if
necessary). Anyway, here are a couple of useful links for future
reference:
page:http://cx-oracle.sourceforge.net/ - has separate installers
by Oracle version.http://cx-oracle.sourceforge.net/html/index.html

My immediate priority at the moment is to get a Mac installer package
built for SOFA. But I am happy to start the ball rolling on an Oracle
database plug-in if I can get a little bit of advice e.g. which versions
to cover (9i, 10g, 11g).

All the best, Grant

Kris Bravo wrote:

I looked for some indication that Sofa supports connecting to an
Oracle database but couldn't find anything. Is this a feature that's
available/in the works/scoped for the future?

Kris

Table hover

Used jquery

Cross-platform Excel

Workbook

get_sheet_names

getsheet

Worksheet

get_data_rows_n

get_fld_names

iter

Ubuntu has python-xlrd so make it a dependency in the Deb. Add Windows installer for xlrd. What about Mac?

xlrd Author: John Machin Licence: BSD

  import xlrd
  book = xlrd.open_workbook("myfile.xls")
  print "The number of worksheets is", book.nsheets
  print "Worksheet name(s):", book.sheet_names()
  sh = book.sheet_by_index(0)
  print sh.name, sh.nrows, sh.ncols
  print "Cell D30 is", sh.cell_value(rowx=29, colx=3)
  for rx in range(sh.nrows):
      print sh.row(rx)
  # Refer to docs for more details.
  # Feedback on API is welcomed.

Old Version Plans

Version 1.1

Trendlines option for single line charts
Ability to export individual output and entire reports as PDF (see Export to PDF), and possibly individual chart output as images.
Improved support for copying and pasting output.
Easier to share value labels between different variables.
Support for multiple response tables
Able to explicitly select language preference

Version 1.2

creating a plug-in system for adding extra stats tests as modules
Random samples
User-defined missing values
Windows installer knowing if packages are installed

Original Design Goals for SOFA Project

A statistics and analysis package with an emphasis on ease-of-use over comprehensiveness
flexible input and beautiful output - i.e. lots of databases supported and attractive, ready-to-present output.
available on PCs, Macs, Linux machines (e.g. Ubuntu) etc (unlike some competitors)
able to handle large datasets (with excellent support for linking to data from MySQL, SQLite, MS Access etc)
easy to install and get started with
to have independently tested statistical tests (they simply must be right - could a University project help with that?)
nicely commented statistical code so that bugs can easily be found and squashed and students can learn more about the algorithms,
an international, online community
paid support available (esp for market research companies)
eventually a book
producing graphs and charts in a non-proprietary format (eventually with customisable templates) - SVG?
the ability to export data ready for SAS or SPSS (for more advanced analysis e.g. by a specialist biostatistician)
the ability to share data with other SOFA users (like SPSS users can send each other sav files
a usable interface to all the tables stored in attached databases
specialist modules
support for python as a scripting language
metadata for files, and fields
journalling of activity
the ability to save python scripts based on GUI-based activity so that people could gently ease their way into more advanced control
output that is easy to share, include in presentations, email, put in spreadsheets etc

Table of Contents

Features/Tasks

What is wrong?

Python code of relevance

Does SOFA do X?

Misc

No longer relevant

Table to Long Form

Auto-updated Office Output

Random Samples

Trendlines on Line Charts

Easier to Share Value Labels for Different Variables

Multiple Response Sets

Approach 1

Approach 2

Copy and Paste Into and From SOFA Data Tables

Histogram-type Frequency Groupings

Quicken importer(/exporter)

Export to PDF

Export to SPSS/SAS

Oracle Connection

Table hover

Cross-platform Excel

Old Version Plans

Version 1.1

Version 1.2

Original Design Goals for SOFA Project