Exploring saved #Kindle Highlights with #SQL2012 Semantic Search

Kindle Highlights

I have said it before, archiving book highlights is one of the best features available on Kindle. You highlight your books on Kindle or any Kindle app and  then you can review them online whenever you want, and with a little effort… you can (you should actually, who knows what amazon will decide tomorrow…) save them locally for personal reference.

Anyway, I was reading about the new SQL 2012 Semantic Search feature and I thought it could be interesting  to run the new semantic search through my Kindle highlights.

There’s an amazing sample by Roger Doherty on codeplex called MySemanticSearch. I don’t know any other good app to explore (as in freely explore…) my highlights, so this seemed to be a very good start to explore the new SQL 2012 semantic search.

Prerequisites

Be sure to :

Getting my Kindle Highlights

First of all, cranking some PowerShell and using the kindle site, I got my very own highlights for several of my Kindle books :

image

Example:

image

Loading Streams into SQL using SQL 2012 FileTable feature

Using FileStreams and the new SQL 2012 Filetable I quickly loaded these into SQL:

— Create Database
CREATE DATABASE MySemanticSearch
ON PRIMARY (Name = FileTableDB, FILENAME = ‘D:\SQL2012\MSSQL11.SQL2012\MSSQL\DATA\MySemanticSearch.mdf’),
FILEGROUP FTFG CONTAINS FILESTREAM
(NAME = FileTableFS, FILENAME=’D:\SQL2012\MSSQL11.SQL2012\MSSQL\FileStream\MySemanticSearch’)
LOG ON
(Name = FileTableDBLog, FILENAME = ‘D:\SQL2012\MSSQL11.SQL2012\MSSQL\DATA\MySemanticSearch.ldf’)
WITH FILESTREAM (NON_TRANSACTED_ACCESS = FULL,
DIRECTORY_NAME = N’FileTableDB’);
GO

Then the FileTable

CREATE TABLE MyKindleHighlights AS FileTable
WITH
(FileTable_Directory = ‘MyKindleHighlights’);

GO

Loading files into BLOBs using SQL FileTable shared folders

Then just copied all my text files into the newly created folder for the FileTable (created automatically by SQL):

image

image

The table updates automatically, content/streams up and running!

image

Creating the full text/semantic index

You can do this with SQL of course (see here), I choose the lazy way, through Management Studio UI :)

image

image

image

image

image

image

image

Using semantic search

It’s now possible to use semantic queries to explore my highlights:

image

Well…. not so great, I don’t know exactly why, but my text files (I think apostrophes and some non printable chars) caused some confusion in the indexer, I get keywords like “isna”, “whata”… not good.

Reparsing the text files was not an option for now, I opt for adding some more stop words to the already existing in the system stop words list. For that I listed all the words in my semantic index:

 

image

I loaded that list in PowerShell and with a very rude dictionary check script borrowed from the web I listed  the words that were unknown nor anywhere in the files. With that I could now create a SQL Full text custom stop word list based on the already existing system stoplist:

CREATE FULLTEXT STOPLIST [HighLightsStopList]
FROM SYSTEM STOPLIST
AUTHORIZATION [dbo];

Using the PowerShell script to produce something like:

image

Configured the Fulltext index to use that stop list instead:

image

image

Run the query again, and as expected those noise words were now removed from my semantic queries:

image

Loading into MySemanticSearch codeplex sample

Time to load that FileTable into MySemanticSearch! Made a few visual adjustments, and a few others  so that it could work on my dev laptop (some strange problem with joining stream_id and document_key, I used path_locator instead of stream_id).

I opted for building MySemanticSearch from the source code, using VS 2010, so that took me some additional time to set up.

But finally all running! :) Connect to my sql 2012 instance:

image

It scans for available FileTables and finds MyKindleHighlights immediately:

image

Some interesting insights (remind you that these are from my biased highlights only, a very small subset of the complete books):

image

image

 

image

 

Final thougths

There seem to be some shortcomings (seems to handle unigrams only and does treats plurals/singular as completely different keywords, ex cat and cats… ). From a “semantic” search point of view I would expect a little better. May be I’m missing something, as this was a very initial test.

Nevertheless,  as an easily configurable and completely automated unstructured text mining tool now available in SQL Server, it is a very valuable addition to SQL Server.

Reference:

 

Rui

Advertisements

About Rui Quintino
Developer & Software Architect @ DevScope

One Response to Exploring saved #Kindle Highlights with #SQL2012 Semantic Search

  1. Pingback: Browsing & exploring a “big” #kindle library, part II, now with #Excel2013 #PowerView « Rui Quintino Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: