Exploring saved #Kindle Highlights with #SQL2012 Semantic Search
October 16, 2012 1 Comment
I have said it before, archiving book highlights is one of the best features available on Kindle. You highlight your books on Kindle or any Kindle app and then you can review them online whenever you want, and with a little effort… you can (you should actually, who knows what amazon will decide tomorrow…) save them locally for personal reference.
Anyway, I was reading about the new SQL 2012 Semantic Search feature and I thought it could be interesting to run the new semantic search through my Kindle highlights.
There’s an amazing sample by Roger Doherty on codeplex called MySemanticSearch. I don’t know any other good app to explore (as in freely explore…) my highlights, so this seemed to be a very good start to explore the new SQL 2012 semantic search.
Be sure to :
- (check MySemanticSearch prerequisites page)
- IIS 7
- .NET 4.0
- SQL 2012
- Install SQL 2012 with the FileStream & Semantic Search feature
- Attach & configure the Semantics Db
Getting my Kindle Highlights
First of all, cranking some PowerShell and using the kindle site, I got my very own highlights for several of my Kindle books :
Loading Streams into SQL using SQL 2012 FileTable feature
Using FileStreams and the new SQL 2012 Filetable I quickly loaded these into SQL:
— Create Database
CREATE DATABASE MySemanticSearch
ON PRIMARY (Name = FileTableDB, FILENAME = ‘D:\SQL2012\MSSQL11.SQL2012\MSSQL\DATA\MySemanticSearch.mdf’),
FILEGROUP FTFG CONTAINS FILESTREAM
(NAME = FileTableFS, FILENAME=’D:\SQL2012\MSSQL11.SQL2012\MSSQL\FileStream\MySemanticSearch’)
(Name = FileTableDBLog, FILENAME = ‘D:\SQL2012\MSSQL11.SQL2012\MSSQL\DATA\MySemanticSearch.ldf’)
WITH FILESTREAM (NON_TRANSACTED_ACCESS = FULL,
DIRECTORY_NAME = N’FileTableDB’);
Then the FileTable
CREATE TABLE MyKindleHighlights AS FileTable
(FileTable_Directory = ‘MyKindleHighlights’);
Loading files into BLOBs using SQL FileTable shared folders
Then just copied all my text files into the newly created folder for the FileTable (created automatically by SQL):
The table updates automatically, content/streams up and running!
Creating the full text/semantic index
You can do this with SQL of course (see here), I choose the lazy way, through Management Studio UI :)
Using semantic search
It’s now possible to use semantic queries to explore my highlights:
Well…. not so great, I don’t know exactly why, but my text files (I think apostrophes and some non printable chars) caused some confusion in the indexer, I get keywords like “isna”, “whata”… not good.
Reparsing the text files was not an option for now, I opt for adding some more stop words to the already existing in the system stop words list. For that I listed all the words in my semantic index:
I loaded that list in PowerShell and with a very rude dictionary check script borrowed from the web I listed the words that were unknown nor anywhere in the files. With that I could now create a SQL Full text custom stop word list based on the already existing system stoplist:
CREATE FULLTEXT STOPLIST [HighLightsStopList]
FROM SYSTEM STOPLIST
Using the PowerShell script to produce something like:
Configured the Fulltext index to use that stop list instead:
Run the query again, and as expected those noise words were now removed from my semantic queries:
Loading into MySemanticSearch codeplex sample
Time to load that FileTable into MySemanticSearch! Made a few visual adjustments, and a few others so that it could work on my dev laptop (some strange problem with joining stream_id and document_key, I used path_locator instead of stream_id).
I opted for building MySemanticSearch from the source code, using VS 2010, so that took me some additional time to set up.
But finally all running! :) Connect to my sql 2012 instance:
It scans for available FileTables and finds MyKindleHighlights immediately:
Some interesting insights (remind you that these are from my biased highlights only, a very small subset of the complete books):
There seem to be some shortcomings (seems to handle unigrams only and does treats plurals/singular as completely different keywords, ex cat and cats… ). From a “semantic” search point of view I would expect a little better. May be I’m missing something, as this was a very initial test.
Nevertheless, as an easily configurable and completely automated unstructured text mining tool now available in SQL Server, it is a very valuable addition to SQL Server.
- Store and Index Documents in SQL Server 2012 – An End-to-End Walkthrough
- FileTables (SQL Server)
- Exploring Semantic Search Key Term Relevance
- SQL Server Unstructured Data Analysis (Semantic Search)
- Preparing for Sematic Search
- Step-by-Step: Enabling Semantic Search on SQL Server 2012
- Creating a Full-Text Index on a FileTable in SQL Server 2012
- How stopwords affect Property Search in SQL Server 2012 Full-Text Search