Editing Twitter Analysis DB Details

Jump to navigation Jump to search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

Latest revision Your text
Line 154: Line 154:
 
I have tried to make the above accurate, but read the code.
 
I have tried to make the above accurate, but read the code.
 
   
 
   
TweetTableWriter and ConcordTableWriter are helper classes doing the actual table writing
 
  
== Concord Words ==
+
TweetTableWriter and ConcordTableWriter are helper classes doing the actual table writing
  
The simplest way to get a word list is simply to break up all words on whitespace.  Here things are a bit more complicated.  Some of this is to "normalize" the words.  A simple example is to make all the words lower case.  But there is a fuller story:
 
 
* Lower case everything.
 
* There are some odd non ascii characters in the tweets that are first converted to more normal characters or white space.
 
* Classify words ( english words do not start with numerals, but "tweet words" do.  So we classify the words.  We also have words starting with @, # for hashtags, htttp for urls )
 
* Linguistis have a way of normalizing words to what they call lemmas, I do this using a library calld spacy.  Amoung other things it converts plurals to singulars.
 
* Note that many words in the concordance are in some sense not words at all, ( even excluding hashtags.... ).  When we try to match them up to a table of word usage ( words ) these match ups fail.
 
  
 
=== Words ===
 
=== Words ===

Please note that all contributions to OpenCircuits may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see OpenCircuits:Copyrights for details). Do not submit copyrighted work without permission!

Cancel Editing help (opens in new window)