Skip navigation

Budget - Resources - Time

This week the POC2-EDITH team have been struggling with managing the magic project troika of resources, priorities and time.

Each one of us have had competing priorities affecting what we can achieve.  The will and skill are there. Education.au’s experience for Proof of Concept issues has revealed time is our most critical factor in these projects.  We are looking at a two week delay.

With all our competing ongoing priorities, this week we worked hard to map out a plan to advance what we can advance in the background while we tend to the other priorities:

  1. capture the URLs and basic metadata
  2. build the most basic evaluation filters
  3. spidering websites based on the basic evaluation results

Later we can do the cool stuff: proper TF.IDF analysis and evaluation etc.

Learning lessons at each milestone

POC-2 Reflecting on Lessons LearnedThis post on learning lessons is a week late. Learning the lessons at the end of each milestone rather than at the end of the project is a key POC characteristic.

The effort put into the Show and Tell 1 paid off. Everyone was engaged with the topic and there were even a few laughs. They asked good questions, raising some issues we hadn’t considered before.

We reflected upon:

  • what we did,
  • how we did it,
  • the issues people raised,
  • what we did well and
  • what could have done better.

The Show and Tell

  • Availability of people on a Friday. Should we test on a different day? Look at Thursday. We are looking for opportunity for better feedback – giving people time to think about the presentation but not too long like the weekend.
  • Use of major image to illustrate the most important concept worked
  • The Show and Tell clearly demonstrated what the IOs really do.

Show and Tell Issues raised

Copyright

  • what is specifically copyright: a bookmark? a tag? a collection of bookmarks?
  • who owns the copyright? del.icio.us or the bookmarker or no one?

Communicating with the users of del.icio.us

  • should we tell them even though the del.icio.us has an public bookmarks and open API system?
  • would identifying key users open the system up for abuse and distort the results?

Does this threaten the work that the employed IOs do here?

  • [Response:no - EDITH is designed to provide the IOs with better tools which better utilise their expertise and really add value to the discovered resources.]
  • Invite IOs to attend POC2 meeting for their view of the EDITH project.

Vision

A point was raised whether ‘Users’ is right word in the vision. ‘Community’ was suggested as an alternative.

Are we really doing with ‘user engagement’. The approach is not engaging with users and the foreseeable
development path has a limited engagement (one way). Revisit next week.

Note to readers (you): we really would welcome your thoughts on this issue.

Process - worked well

Having POC1 being rolled into a real MyEdna project has given credibility to the POC process and some momentum. It is not as difficult to ’sell’ the idea that someone is doing work on the proof of concept.

  • Using Yahoo! Pipes as a demonstrator to validate general principles was very very instructive for prototyping EDITH.
  • Benefit of being flexible, providing the extra development week was very successful
  • Collection Mgt is processing time intensive – this directly impacts on development time.

What we did

The real results impressed people. The raw list of users and links provided many interesting and useful results. Nick’s analysis of the results also revealed. which people at the presentation recognised some.

What we could do for next time

  • Look at dedicated POC kit for processing - maybe an old computer.

Deliberately getting it wrong

Getting it wrong deliberatelyI have become a fan of getting it wrong. Deliberately.

Ultimately it saves me a lot of time.

To explain: currently we are modelling how education.au intranet could be structured. All my preparations for this - content management, portal home page, security model mock-ups have been put up with the proviso that it’s probably totally wrong (requires only a pinch of courage).

What I find out quickly:

  1. People actually state their opinion on what’s been presented
  2. What’s wrong with it
  3. What people want
  4. What the big picture really should be

Informed by this, I can then model what’s really needed and then go do it.

By simply spending a little time in turning an idea into a picture - it gives other people something to kick against.

A slight twist on this is simply having a go and seeing it it works or not. Proof of Concepts are a case in point (without the deliberately getting wrong bit). It is geared to see if we can create something innovative and useful. The best way to see if we have got it either right or wrong is to visualise it with a rapid prototype.

Makes me think of the Sir Ken Robinson TED Talk nativity story (at 4m30s) where kids are not afraid of being wrong - but have a go anyway.  Through this, they learn and grow.
It’s very useful - give it a go.

The proof of the pudding is in the eating as they say.

Friday night live with EDITH 0.1

The following was the Show and Tell 1 to the staff at education.au.

Edna Proof of Concept 2 Show and Tell 1

Pop Quiz! ..and you thought you didn’t have to do any work!

Q: How many current records are held in the edna collection (that is not including the distributed collections like ABC Online, but core edna)?
A: 36704

Q: What did the Dam Vam report estimate as the annual $ value to all educators of their use of the edna collection search?
A:
A) $1 Million
B) $2 Million
C) $4 Million <= correct

It been a huge amount of work invested over the years just in creating the records let alone creating and maintaining the edna collection.

When asked to innovate collection management we first had to understand 'what's hard about collection management'?

The answer lies in building the collection. Scouring the internet, evaluating each resource and describing the results. It's a long, knowledge intensive process.

So what's this Proof of Concept vision?

“Collection Improvement through user engagement and better metadata tools”

Breaking this down we are seeking to

  • Improve the quantity and quality of the edna collection
  • Maintain the integrity of the collection
  • Automate the drudge work
  • Engage with educators where they are online such as social bookmarking sites.

What I’d like to show now is how the way an Information Officer maintains the edna collection will change for the better through the Collection Management POC.

First let’s look at the way our IO, Jade, currently builds resources.

IOs Job: adding a resource

Jade’s job is to discover, evaluate and describe online resources for inclusion into the edna collection. Sounds simple doesn’t it?

DISCOVERY

Let’s have a look at Discovery.

Discover sourcesLike other IOs, Jade has her own list of preferred resource sources as well as aggregated resources.

These include
1) Education Websites
2) Online Journals
3) Online Media outlets
4) RSS news aggregators
5) Events networks
6) Email ListServes
7) Groups
8) Conference alerts
9) Google alerts
10) Government websites and news services

There are many, many more and as this slide indicates – Jade has to:
1) Remember to check each source
2) Take the time to visit them all.

She has:

3) Reduced time add value to the collection –
4) Lack of certainly about the resources’ relevance to educators

  • Wouldn’t it be better if there were fewer places for Jade to check?
  • Wouldn’t it be better if Jade knew what educators really want to keep?
  • How can we save Jade the IO’s time for more valuable activities?

You may have guessed why we have called her Jade – it’s not because she is jaded with her job – she loves it! But she is jaded with the repetitive nature of some of the work and frustrated by not getting to the really valuable part of her work because of the endless scouring of the same places on the internet and the adding of the same metadata over and over again. These are clues to possible points of automation and user contribution.

EVALUATION

EvaluateJade now has her list of URLs as potential inclusion into the collection – now she needs to evaluate. Evaluation, just like programming, is knowledge intensive.
Jade uses the edna collection policy to discriminate between the useful and the non-useful, the promotional and the edifying.

There are three collection management policies:

  • edna Governance Policy
  • edna Collection Policy
  • edna Collection Policy Schedules (sector specific)

These cover issues of

  • Accessibility
  • Authority – from reputable sites
  • Reliability – is the website going to be there tomorrow?
  • Uniqueness – primary source and value
  • Objectivity – impartiality and balance of views
  • Ethics and legality – respectful of law and decency

The second phase of our project will be to look at employing our user engagement and metadata tools to improve the evaluation stage. Show and Tell 2 will cover this in detail; it shows evaluation as a knowledge intensive process to maintain the integrity of the edna collection.

DESCRIBE

DescribeWith the evaluated resources, Jade can now describe the online resources.

DSPACE is the workshop of the edna collection.

Jade passes through DSPACE’s 6 phases, describing each resource – a lengthy process incorporating further checks, many metadata schemas, and interpretation of edna policy.

With hard work, persistence and hard work with the application of her knowledge and expertise, Jade will have grown the edna collection by at least 1 record. This is how Jade and the other IOs have grown edna Collection records to 36,000 records.

INNOVATION

So how are we innovating what’s hard about collection management? By engaging with educators where they ‘nest’ online eg the social bookmarking sites like Del.icio.us; by automating the drudge work.

EDITH

We do currently provide for users to suggest sites to edna. We ask them to submit and we receive an email, This is a very web 1.0 approach to things and it also results in very few actual edna entries. With our proposed model we will go out and fine what the users think is useful.

How will we do this? With EDITH. We are innovating Jade’s job with the EDITH engine: Edna Discovery Information Trapper and Hoarder engine. We’ll automate as much as possible and provide the IO with as much information as possible to assist their decision making process.

The EDITH engine is being designed to tap into social networking information repositories like Del.icio.us, Scuttle, Digg, Wikipedia. We are starting with delicious as a proof of concept, butt he model could be adapted to include other sources.

Quick Quiz

Q: Approximately how many internet users are there in the world?
A:
A) 700 million
B) 1.1 billion <-- correct
C) 1.9 billion

Q: What's NetCraft's best guess for the number of active websites in the world?
A:
A) 50 million
B) 100 million
C) 122 million <-- correct

And worth mentioning that’s only sites – Google searches over 8 billion web pages – and that’s not the whole web.

Wisdom Vs MassesQ: Approximately how many del.icio.us users are there in the world?
A:
A) 220 thousand
B) 370 thousand
C) 2 million <-- correct

MadnessQ: How many Information Officers are there in education.au?
A: 11

1.1 billion people using 122 million websites makes discovering educationally valuable online resources daunting. It’s impossible for 11 ed.au IOs to cover everything.

The POC team quickly realised that in fact there are ‘Accidental Information Officers’ out there in the Social bookmarking sites. How do we find them?

With 2 millions social bookmarking describing the web, the challenge for us was to separate the wisdom of the crowd from the madness of the masses.

The team set out to find the:

  • Authoritative Users and
  • Authoritative Tags

The solution was to find del.icio.us users who have tags which match the edna collection categories keywords. For example we discovered a user called ‘Mark Booker’. He loves to bookmark!

Another quantitative check employed is the number of times a resource has been bookmarked.

We further checked Mark Booker’s educational bookmarking credentials by checking if some of his bookmarks already exist in the edna collection. Passing this test Mark Booker has become an Accidental Information Officer.

The wiseIn fact there are there are 8,400 other ‘Mark Bookers’ out there. Nick is going to show you how we capture what they are bookmarking. Compare this figure of 8,400 with the 140 ed.au DSPACE users who have helped create the collection over 10 years.

This provides us a bookmark candidate list for evaluating and describing from this group of Accidental Information Officers. We are getting not fully evaluated resources coming to us but resources that are healthy candidates, coming from promising users and promising tags.

What have we actually achieved here? In fact we have answered our earlier questions:

Wouldn’t it be better if Jade didn’t have to check so many places?

Yes – EDITH will collate many sources into a single place.

Wouldn’t it be better if Jade knew what educators really want to keep?

Yes – EDITH discovers what the Mark Bookers believe is important to them.

How can we save Jade the IO’s time for more valuable activities?

  • Having greater coverage of the internet by more people
  • Reducing search time
  • Giving her more time to look in depth at complex resources

Is this all theory?

No. Results are real – new edna collection resources have already been added. Nick will tell you about the approach and results in detail.

What problems are there?

There are some questions to consider:

It is dependent on what people bookmark. PDFs are harder to bookmark for example.

Quantity is not an indicator of quality.

We need to think about potential copyright issues with mining people’s bookmarks.

The next Show and Tell will address issues of evaluation of quality; helping to create a shortlist from this mass of candidate bookmarks.

——-

Images of people for “Discovery: the internet” sourced from http://iconka.com and MS Office Clip art.

Whiplashed by The Long Tail effect

Mike’s messing with my mind again. Here I was, innocently prep’ing the POC SnT and he waltzes in and drops another concept like a smart bomb for the collection management proof of concept.

While we’re trying to find authoritative social bookmarking users, we’ll be finding those bookmarkers whose sheer quantity of bookmarks can create a false signal.

As Mike points out, what would be cool is if we could find the obscure bookmarkers with resources of extreme value that the masses have overlooked - and take advantage of the ‘Long Tail Effect’.

Some useful reading on long tail effect…

Both you and I have a specific educational focus (call this ’signal’) and a few differing interests (call these ‘noise’).

So does a third person; however they might have lots of other interests (call this ‘very noisy’).

Our signal to noise ratio would be very similar - unlike that with the third person. The sheer noise they generate in the current web world would make them ‘popular’ and rise in the rankings.

Doesn’t it make sense that you and I should collaborate together more than with the third person?

We have more in common as evidenced by our bookmarks, flickr photos, online subscriptions etc.

I could use this principle by seek out those who are living in the Long Tail online (obscure) but are are highly selective and therefore highly relevant to what my focus is.

What is signal to me is noise to someone else.

This alternative qualitative approach could help capture those people we’ll miss through our quantity based POC metrics.

Discovery implies recognition of value. In the internet world this is likely to create new ecosystems of interest in old or obscure ideas to feed further creativity, culture and/or innovation. For example: who really invented the LED?

The sheer noise they generate in the current web world would make them ‘popular’ and rise in the rankings.

Another ripper idea for the POC carpark. Thanks Mike for the great discussion

The elves and the shoe maker

EDITH Australian bowerbird hoarding all things blueCode is being cut. Presentation is being prepared. Speeches are being spun. We have EDITH prepared for display next week.

The mining of http://del.icio.us/ is showing very interesting results with 1000’s of suggested sites based on the authoritative tags and users. Importantly the early indications of quality means the underlying assumptions are working well.

This is an immediate outcome for the information officers here. [Nick - is there a public feed we can publish here?]

We’re in a bit of a quandary whether to announce the authoritative tags and users because there is potential to have these authoritative sources distorted / abused.

****

Vaughan & Nick will have a chat to a PhD student interested in machine learning and text mining for Show and Tell 2.

****

On another topic, Sarah’s going to lead the Show and Tell emphasising the contrast between now and EDITH’s potential.

One interesting point she raised today is EDITH builds upon the collaborative nature of the internet & ultimately it will feed back as the resources are eventually described in detail.

****

Part of the evaluation phase is checking online blacklist services. We’ll see if that makes it in for SnT-1.

****

The EDITH logo? It’s a Satin Bowerbird (Ptilonorhynchus violaceus) which likes all things blue like these glasses.

Office tips to show off

Office Tips frogHave a presentation to give tonight on very simple MS Office tips and tricks. Thought others might find this useful.

The example is simple and can be achieved in lots of other ways - however it’s the thinking that’s important. You can reformat lots of data quickly with creative uses of MS Office tools.

Feel free to distribute these files with no restrictions / conditions.

I’m going to see how these tips translate to OpenOffice at home tonight.

Presentation files: Office Tips To Show Off

Preview the presentation [links to SlideShare.net]

What’s cool about evaluating?

Sarah has got us all thinking about the evaluation in while we prepare for show and tell 1. Hopefully this strategy will help minimise the project slippage.

Evaluation discussion stepped around filtering the SnT-1 discovered resource candidates. The list was fairly straight forward:

  • Check edna for duplicates
  • Check SPAM
  • Check blacklists
  • Check Scope criteria: language, geographic info.

All very ho hum not very interesting. The most interesting point was the idea - ‘if someone has bookmarked it then it’s relevant - so we can ignore the date criteria.’

What really got us charged up was to Vaughan’s idea to discover which metadata (type and value) is significant as a predictor for evaluation and/or classification of an online resource identified from the previous Discovery phase. To quote Vaughan:

Latent Semantic Analysis (Berry et al) is an example of a technique that maps a document collection to a lower dimensional space to that traditionally used (where each index term effectively forms another dimension in the search space, as used for example in traditional TF.IDF [term frequency–inverse document frequency] similarity measures”.

Getting Vaughan to explain it to me in plainer English:

Documents are usually represented by the terms that occur within them but some of those terms are more significant than others in representing the content. We want to identify those high-information-value terms and throw away the chaff.

So we’re actually identifying both wheat and chaff by valuing the terms used?

Yeah - Lucene, etc do it using a TF.IDF metric.

This is a weighting of terms?

Yes. We’re evaluating for both the metadata term and value. The addition of field types adds information but ultimately we still have a bunch of features still, regardless some of them are not going to be significant in separating relevant from non-relevant docs, that is.

How does that relate to using edna as a training set?

We can use existing classifications to generate some more detailed ’signatures’ for representing docs. In a given class/category/sector thus we can build a predictor/evaluator for a given training set (sector/category)we can do this at various levels of granularity, from the collection level to the smallest category so we can train up a series of predictors.

As ed.au’s resident search expert, Vaughan mentions there’s lots of further possibilities with this approach for instance:

  • Education Sector prediction
  • Meta data suggestion
  • lots more…

We all agreed that this Meta Discovery thing sounded like the cool thing to do for evaluation.

Ambitious? Nah - just a walk in the POC.

Telling shows and tells

POC 2 SlipsDevelopment time is tight - the SnT-1 delay needs to be two weeks.

Nick pointed out that we don’t need to delay thinking about SnT-2 and in fact we can start on scoping SnT-2 (next post).

But first we need to sort out the SnT-1. I prepared a running sheet for it; & crikey it is boring.

We really want to engage our audience (initially the staff here at ed.au). The MyEdna POC Show and Tell Q&A sessions had tumbleweed passing us by. How to improve it? Jokes? An IO fable? Show the shiny stuff quick?

We decided that we going to prepare relevant quiz questions to prod the audience with and show how such a simple process of resource discovery is not so simple. Hopefully this way we’ll avoid people nodding off on a Friday afternoon.

Sarah & I reflected on the usefulness of the boring running sheet was as a discussion point starter - it helped us move from what we don’t want to what we really want.

I figure this is an important way proof of concepts can work if inspiration fails at first.

A POC’s unintended consequences

understandingA serendipitous outcome from today’s meeting were some tips to make our jobs just that little bit easier - totally separate from POC.

With this POC, we’re wrestling in detail with understanding the way an Information Officer works to provide resources for educators.

Our discussion prompted Nick to suggest using Google alerts to monitor websites without RSS or email newsletters. This is immediately useful to all IOs here.

This sharing of knowledge and understanding each other jobs are great unintended consequences of the Proof of Concept.