Data Transformation: Getting more out of your data – Part 2

Creating new data and deriving data from existing data

Sometimes the data you need does not exist in your database or it’s buried in a format that’s difficult to feed into a visualization utility. This example takes existing data and generates more explicit data. The examples turns three tables into six and then creates a single simple view with many new derived columns.

In this example I’ll use a school that is trying to understand more about student attendance, enrollment, test scores and correlation between these datasets. The data was not initially in an easy to report format. Individual queries could be made but not easily merged into a single dataset for analysis. By thinking about the end reports I wanted to create I decided I needed detailed information about what was happening on each individual day of the school year: valid school dates, which students where enrolled on each date, which students were in attendance on a specific date. I would also then need to do some analysis on that data and classify students into categories based on some of that information.

I started with the existing tables:
*All student absences (StudentID, AbsenceDate).
*Student enrollment (StudentID, StartDate, EndDate) for each year. A student could leave and come back so the table could have multiple entries.
*Student test scores for various assessments for each school year (StudentID, SchoolYear, Score1, Score2, Score3, etc).

One thing that was important was knowing each valid school day for each school year. Initially no table existed with a listing of valid school days. I first created a new table where each row represented a valid date school was in session (SchoolYear, SchoolDate). I was able to populate this table with an SQL statement by selecting unique dates from the student absence table; assuming correctly that at least one student was marked absent for each day of the school year. I manually validated this by cross-referencing my calendar against a printed school year calendar with dates and then also confirmed against unplanned emergency school closures.

I also needed a table that listed individual dates a student was actually enrolled in the school. I used actual programming logic that looked at the student’s enrollment start and end dates and then iterated through that list of days and inserted a row in the new table for valid school days. (StudentID, SchoolDate).

At this point I had a pretty detailed set of tables indicating what was happening on each school day, each school year and each student enrolled at any point during the periods I was examining.Note: Don’t try to make sense of these numbers because they are obfuscated for obvious reasons. Notice that each color of line is data provided as a separate column using the newly created data; this made it easier for Excel to graph. By putting the data into a graph you can visually see the number of students attending who were transient or full-time students. There is a lot of trend information to unpack in this graph but that’s another discussion; I created a separate narrative document with simple steps to understand the presentation and the trends.

Next I needed to classify each student into some categories for each school year. The first category was whether the student should be considered enrolled full-time or only partially enrolled for the year (transient). A student was flagged as transient if they were enrolled less than a specific percentage of the possible school days of the year. Using the recently create table for SchoolYear dates and student enrolled dates I was able to create a simple query to populate a new table with student categorizations for each school year (StudentID, SchoolYear, StudentEnrollmentStatus = “Full-time”, “Transient”).

A column was added to the previous table to categorize the student’s attendance as problematic or not. If a student was considered “High Absence” if they were absent 15 more days and “Low Absence” if they were absent less than 15 days. This information was added to a column in the same table as the EnrollmentStatus column.

At this point I could tell you on any given school day how many students were in attendance, how many absent, how many of those students were considered full-time/transient and how many had attendance issues.

Now we can correlate student’s enrollment status to how they performed on some different assessments.Again, don’t try to make sense of the actual values because they are once again obfuscated.

You can see that scores from two different assessments are displayed (DIBELS-Daze, DIBELS-NWS-CLS) and students are classified into three categories (Transient, Full-Time, Long Term); Long Term meaning the student has been at the school full-time for the past three years.

Bottom line: You know you have the data you need to do the analysis but can’t quite figure out how to feed it into something that makes it easier to view. Think about what you want to know and then work your way back into specific data elements and structures. Pacific Database can help you make your existing data more robust and more obvious to use for analysis.