Welcome to the TextVisDRG documentation!¶
This project is a prototype exploratory visual data analysis tool designed for social scientists working with large social message data sets.
Contents:
Apps¶
Base¶
Base is the core app that serves the main pages of the visualization application. It also gathers together some miscellaneous utilities and shared classes that are used by other apps.
Views¶
Template Tags¶
Checks the current request to see if it matches a pattern. If so, it returns ‘active’.
To use, add this to your Django template:
{% load tags %} <li class="{% active request home %}"><a href="/">Home</a></li>
Models¶
-
class
msgvis.apps.base.models.
MappedValuesQuerySet
(*args, **kwargs)[source]¶ A special ValuesQuerySet that can re-map the dictionary keys while they are bing iterated over.
valuesQuerySet = queryset.values('some__ugly__field__expression') mapped = MappedQuerySet.create_from(valuesQuerySet, { 'some__ugly__field__expression': 'nice_expression' }) mapped[0] # { 'nice_expression': 5 }
Importer¶
The Importer app is concerned with getting corpus data into the database. It defines a number of Django management commands for making this easier.
Commands¶
-
class
msgvis.apps.importer.management.commands.import_corpus.
Command
[source]¶ Import a corpus of message data into the database.
$ python manage.py import_corpus <file_path>
-
class
msgvis.apps.importer.management.commands.import_twitter_languages.
Command
[source]¶ Import supported languages from the Twitter API into the database. If the languages already exist in the database, they will not be duplicated.
Note
Requires the tweepy Twitter API library:
pip install tweepy
Example:
$ python manage.py import_twitter_languages
-
class
msgvis.apps.importer.management.commands.import_twitter_timezones.
Command
[source]¶ Obtains a mapping of the Twitter-supported timezones from the Ruby on Rails TimeZone class.
Get the mapping dictionary from https://github.com/rails/rails/blob/master/activesupport/lib/active_support/values/time_zone.rb
Note
Requires Ruby on Rails to be installed:
gem install rails
.Example:
$ python manage.py import_twitter_timezones setup/time_zone_mapping.rb
Twitter Integration¶
Utilities for working with Twitter.
Models¶
-
msgvis.apps.importer.models.
create_an_instance_from_json
(json_str, dataset_obj)[source]¶ Given a dataset object, imports a tweet from json string into the dataset.
Enhance¶
The Enhance app is supposed to integrate with external tools to add metadata to the raw corpus data.
For example, it might use an external sentiment analysis tool to label messages for sentiment.
Dimensions¶
The Dimensions app provides functionality for asking about dimension metadata, including distributions within a dimension over a dataset.
Registry¶
Import this module to get access to dimension instances.
from msgvis.apps.dimensions import registry
time = registry.get_dimension('time') # returns a TimeDimension
time.get_distribution(a_dataset)
-
msgvis.apps.dimensions.registry.
get_dimension
(dimension_key)[source]¶ Get a specific dimension by key
Models¶
-
msgvis.apps.dimensions.models.
find_messages
(queryset)[source]¶ If the given queryset is actually a
Dataset
model, get its messages queryset.
-
class
msgvis.apps.dimensions.models.
CategoricalDimension
(key, name=None, description=None, field_name=None, domain=None)[source]¶ A basic categorical dimension class.
Attributes:
key (str): A string id for the dimension (e.g. ‘time’)
name (str): A nicely-formatted name for the dimension (e.g. ‘Number of Tweets’)
description (str): A longer explanation for the dimension (e.g. “The total number of tweets produced by this author.”)
- field_name (str): The name of the field in the database for this dimension (defaults to the key)
Related to the Message model: if you want sender name, use sender__name.
Return True for real categorical dimensions
-
exclude
(queryset, **kwargs)[source]¶ Exclude some points from a queryset and return the new queryset.
-
group_by
(queryset, grouping_key=None, values_list=False, values_list_flat=False, **kwargs)[source]¶ Return a ValuesQuerySet that has been grouped by this dimension. The group value will be available as grouping_key in the dictionaries.
The grouping key defaults to the dimension key.
messages = dim.group_by(messages, 'value') distribution = messages.annotate(count=Count('id')) print distribution[0] # { 'value': 'hello', 'count': 5 }
-
select_grouping_expression
(queryset, expression)[source]¶ Add an expression for grouping to the queryset’s SELECT. Returns the queryset plus the alias for the expression.
For categorical dimensions this is a no-op. Beware if your expression refers to a related table!
-
class
msgvis.apps.dimensions.models.
ChoicesCategoricalDimension
(key, name=None, description=None, field_name=None, domain=None)[source]¶ A categorical dimension where the values come from a choices set.
Don’t use for related fields.
-
class
msgvis.apps.dimensions.models.
RelatedCategoricalDimension
(key, name=None, description=None, field_name=None, domain=None)[source]¶ A categorical dimension where the values are in a related table, e.g. sender name.
Currently doesn’t really do much beyond CategoricalDimension.
Return True for related categorical dimensions
-
class
msgvis.apps.dimensions.models.
QuantitativeDimension
(key, name=None, description=None, field_name=None, default_bins=50, min_bin_size=1)[source]¶ A generic quantitative dimension. This works for fields on Message or on related fields, e.g. field_name=sender__message_count
-
get_range
(queryset)[source]¶ Find a min and max for this dimension, as a tuple. If there isn’t one, (None, None) is returned.
-
get_grouping_expression
(queryset, bins=None, bin_size=None, **kwargs)[source]¶ Generate a SQL expression for grouping this dimension. If you already know the bin size you want, you may provide it. Or the number of bins.
-
select_grouping_expression
(queryset, expression)[source]¶ Add an expression for grouping to the queryset’s SELECT.
Returns a queryset, grouping_key tuple. The grouping_key could be used in values to identify the grouping expression.
-
group_by
(queryset, grouping_key=None, bins=None, bin_size=None, **kwargs)[source]¶ Return a ValuesQuerySet that has been grouped by this dimension. The group value will be available as grouping_key in the dictionaries.
The grouping key defaults to the dimension key.
If num_bins or bin_size is not provided, an estimate will be used.
messages = dim.group_by(messages, 'value', 100) distribution = messages.annotate(count=Count('id')) print distribution[0] # { 'value': 'hello', 'count': 5 }
-
-
class
msgvis.apps.dimensions.models.
RelatedQuantitativeDimension
(key, name=None, description=None, field_name=None, default_bins=50, min_bin_size=1)[source]¶ A quantitative dimension on a related model, e.g. sender message count.
-
class
msgvis.apps.dimensions.models.
TimeDimension
(key, name=None, description=None, field_name=None, default_bins=50, min_bin_size=1)[source]¶ A dimension for time fields on Message
Corpus¶
The Corpus app is concerned with the representation of raw message data and its associated metadata.
Models¶
-
class
msgvis.apps.corpus.models.
Dataset
(*args, **kwargs)[source]¶ A top-level dataset object containing messages.
-
name
= None¶ The name of the dataset
-
description
= None¶ A description of the dataset.
-
created_at
= None¶ The
datetime.datetime
when the dataset was created.
-
start_time
= None¶ The time of the first real message in the dataset
-
end_time
= None¶ The time of the last real message in the dataset
-
-
class
msgvis.apps.corpus.models.
MessageType
(*args, **kwargs)[source]¶ The type of a message, e.g. retweet, reply, original, system...
-
name
= None¶ The name of the message type
-
-
class
msgvis.apps.corpus.models.
Language
(*args, **kwargs)[source]¶ Represents the language of a message or a user
-
code
= None¶ A short language code like ‘en’
-
name
= None¶ The full name of the language
-
-
class
msgvis.apps.corpus.models.
Url
(*args, **kwargs)[source]¶ A url from a message
-
domain
= None¶ The root domain of the url
-
short_url
= None¶ A shortened url
-
full_url
= None¶ The full url
-
-
class
msgvis.apps.corpus.models.
Hashtag
(*args, **kwargs)[source]¶ A hashtag in a message
-
text
= None¶ The text of the hashtag, without the hash
-
-
class
msgvis.apps.corpus.models.
Media
(*args, **kwargs)[source]¶ Linked media, e.g. photos or videos.
-
type
= None¶ The kind of media this is.
-
media_url
= None¶ A url where the media may be accessed
-
-
class
msgvis.apps.corpus.models.
Timezone
(*args, **kwargs)[source]¶ The timezone of a message or user
-
olson_code
= None¶ The timezone code from pytz.
-
name
= None¶ Another name for the timezone, perhaps the country where it is located?
-
-
class
msgvis.apps.corpus.models.
Person
(*args, **kwargs)[source]¶ A person who sends messages in a dataset.
-
original_id
= None¶ An external id for the person, e.g. a user id from Twitter
-
username
= None¶ Username is a short system-y name.
-
full_name
= None¶ Full name is a longer user-friendly name
-
message_count
= None¶ The number of messages the person produced
-
replied_to_count
= None¶ The number of times the person’s messages were replied to
The number of times the person’s messages were shared or retweeted
-
mentioned_count
= None¶ The number of times the person was mentioned in other people’s messages
-
friend_count
= None¶ The number of people this user has connected to
-
follower_count
= None¶ The number of people who have connected to this person
-
profile_image_url
= None¶ The person’s profile image url
-
-
class
msgvis.apps.corpus.models.
Message
(*args, **kwargs)[source]¶ The Message is the central data entity for the dataset.
-
original_id
= None¶ An external id for the message, e.g. a tweet id from Twitter
-
type
¶ The
MessageType
Message type: retweet, reply, origin...
-
time
= None¶ The
datetime.datetime
(in UTC) when the message was sent
-
sentiment
= None¶ The sentiment label for message.
-
replied_to_count
= None¶ The number of replies this message received.
The number of times this message was shared or retweeted.
The set of
Hashtag
in the message.
-
text
= None¶ The actual text of the message.
-
Datatable¶
The Datatable app is responsible for generating and returning visualization data for specific configurations of dimensions and filters.
Questions¶
The Questions app is concerned with persisting research questions and articles to the database and retrieving research questions that correspond to current dimension selections.
Models¶
-
class
msgvis.apps.questions.models.
Article
(*args, **kwargs)[source]¶ A published research article.
-
year
= None¶ The publication year for the article.
A plain-text author list.
-
link
= None¶ A url to the article.
-
title
= None¶ The title of the article.
-
venue
= None¶ The venue where the article was published.
-
-
class
msgvis.apps.questions.models.
Question
(*args, **kwargs)[source]¶ A research question from an
Article
. May be associated with a number ofDimensionKey
objects.-
source
¶ The source article for the question.
-
text
= None¶ The text of the question.
-
dimensions
¶ A set of dimensions related to the question.
-
API¶
The purpose of the API is to provide access to statistical summaries of the message database that can be used to render visualizations. With many of the API requests, a JSON object should be provided that indicates the user’s current interest and affects how the results will be delivered.
API Objects¶
This module defines serializers for the main API data objects:
DimensionSerializer |
JSON representation of Dimensions for the API. |
FilterSerializer |
Filters indicate a subset of the range of a specific dimension. |
MessageSerializer |
JSON representation of Message objects for the API. |
QuestionSerializer |
JSON representation of a Question object for the API. |
-
class
msgvis.apps.api.serializers.
DimensionSerializer
(instance=None, data=<class rest_framework.fields.empty>, **kwargs)[source]¶ JSON representation of Dimensions for the API.
Dimension objects describe the variables that users can select to visualize the dataset. An example is below:
{ "key": "time", "name": "Time", "description": "The time the message was sent", }
-
class
msgvis.apps.api.serializers.
FilterSerializer
(instance=None, data=<class rest_framework.fields.empty>, **kwargs)[source]¶ Filters indicate a subset of the range of a specific dimension. Below is an array of three filter objects.
[{ "dimension": "time", "min_time": "2010-02-25T00:23:53Z", "max_time": "2010-02-28T00:23:53Z" }, { "dimension": "words", "levels": [ "cat", "dog", "alligator" ] }, { "dimension": "reply_count", "max": 100 }]
Although every filter has a
dimension
field, the specific properties vary depending on the type of the dimension and the kind of filter.At this time, there are three types of filters:
- Quantitative dimensions can be filtered using one or both of the
min
andmax
properties (inclusive). - The time dimension can be filtered using one or both of the
min_time
andmax_time
properties (inclusive). - Categorical dimensions can be filtered by specifying an
include
list. All other items are assumed to be excluded.
The ‘value’ field may also be used for exact matches.
- Quantitative dimensions can be filtered using one or both of the
-
class
msgvis.apps.api.serializers.
MessageSerializer
(instance=None, data=<class rest_framework.fields.empty>, **kwargs)[source]¶ JSON representation of
Message
objects for the API.Messages are provided in a simple format that is useful for displaying examples:
{ "id": 52, "dataset": 2, "text": "Some sort of thing or other", "sender": { "id": 2, "dataset": 1 "original_id": 2568434, "username": "my_name", "full_name": "My Name" }, "time": "2010-02-25T00:23:53Z" }
Additional fields may be added later.
-
class
msgvis.apps.api.serializers.
QuestionSerializer
(instance=None, data=<class rest_framework.fields.empty>, **kwargs)[source]¶ JSON representation of a
Question
object for the API.Research questions extracted from papers are given in the following format:
{ "id": 5, "text": "What is your name?", "source": { "id": 13, "authors": "Thingummy & Bob", "link": "http://ijn.com/3453295", "title": "Names and such", "year": "2001", "venue": "International Journal of Names" }, "dimensions": ["time", "author_name"] }
The
source
object describes a research article reference where the question originated.The
dimensions
list indicates which dimensions the research question is associated with.
API Endpoints¶
The view classes below define the API endpoints.
Endpoint | Url | Purpose |
---|---|---|
Get Data Table |
/api/table | Get table of counts based on dimensions/filters |
Get Example Messages |
/api/messages | Get example messages for slice of data |
Get Research Questions |
/api/questions | Get RQs related to dimensions/filters |
Message Context | /api/context | Get context for a message |
Snapshots | /api/snapshots | Save a visualization snapshot |
-
class
msgvis.apps.api.views.
DataTableView
(**kwargs)[source]¶ Get a table of message counts or other statistics based on the current dimensions and filters.
The request should post a JSON object containing a list of one or two dimension ids and a list of filters. A
measure
may also be specified in the request, but the default measure is message count.The response will be a JSON object that mimics the request body, but with a new
result
field added. The result field includes atable
, which will be a list of objects.Each object in the table field represents a cell in a table or a dot (for scatterplot-type results). For every dimension in the dimensions list (from the request), the result object will include a property keyed to the name of the dimension and a value for that dimension. A
value
field provides the requested summary statistic.The
result
field also includes adomains
object, which defines the list of possible values within the selected data for each of the dimensions in the request.This is the most general output format for results, but later we may switch to a more compact format.
Request:
POST /api/table
Format: (request without
result
key){ "dataset": 1, "dimensions": ["time"], "filters": [ { "dimension": "time", "min_time": "2015-02-25T00:23:53Z", "max_time": "2015-02-28T00:23:53Z" } ], "result": { "table": [ { "value": 35, "time": "2015-02-25T00:23:53Z" }, { "value": 35, "time": "2015-02-26T00:23:53Z" }, { "value": 35, "time": "2015-02-27T00:23:53Z" }, { "value": 35, "time": "2015-02-28T00:23:53Z" }, "domains": { "time": [ "some_time_val", "some_time_val", "some_time_val", "some_time_val" ] ], "domain_labels": {} }
-
class
msgvis.apps.api.views.
ExampleMessagesView
(**kwargs)[source]¶ Get some example messages matching the current filters and a focus within the visualization.
Request:
POST /api/messages
Format:: (request should not have
messages
key){ "dataset": 1, "filters": [ { "dimension": "time", "min_time": "2015-02-25T00:23:53Z", "max_time": "2015-02-28T00:23:53Z" } ], "focus": [ { "dimension": "time", "value": "2015-02-28T00:23:53Z" } ], "messages": [ { "id": 52, "dataset": 1, "text": "Some sort of thing or other", "sender": { "id": 2, "dataset": 1 "original_id": 2568434, "username": "my_name", "full_name": "My Name" }, "time": "2015-02-25T00:23:53Z" } ] }
-
class
msgvis.apps.api.views.
KeywordMessagesView
(**kwargs)[source]¶ Get some example messages matching the keyword.
Request:
POST /api/search
Format:: (request should not have
messages
key){ "dataset": 1, "keywords": "soup ladies,food,NOT job", "messages": [ { "id": 52, "dataset": 1, "text": "Some sort of thing or other", "sender": { "id": 2, "dataset": 1 "original_id": 2568434, "username": "my_name", "full_name": "My Name" }, "time": "2015-02-25T00:23:53Z" } ] }
-
class
msgvis.apps.api.views.
ActionHistoryView
(**kwargs)[source]¶ Add a action history record.
Request:
POST /api/history
Format:: (request should not have
messages
key){ "records": [ { "type": "click-legend", "contents": "group 10" }, { "type": "group:delete", "contents": "{\"group\": 10}" } ] }
-
class
msgvis.apps.api.views.
GroupView
(**kwargs)[source]¶ Get some example messages matching the keyword.
Request:
POST /api/group
Format:: (request should not have
messages
key){ "dataset": 1, "keyword": "like", "messages": [ { "id": 52, "dataset": 1, "text": "Some sort of thing or other", "sender": { "id": 2, "dataset": 1 "original_id": 2568434, "username": "my_name", "full_name": "My Name" }, "time": "2015-02-25T00:23:53Z" } ] }
-
class
msgvis.apps.api.views.
KeywordView
(**kwargs)[source]¶ Get top 10 keyword results.
Request:
GET /api/keyword?dataset=1&q= [...]
{ "dataset": 1, "q": "mudslide oso", "keywords": ["mudslide oso", "mudslide oso soup", "mudslide oso ladies"] }
-
class
msgvis.apps.api.views.
ResearchQuestionsView
(**kwargs)[source]¶ Get a list of research questions related to a selection of dimensions and filters.
Request:
POST /api/questions
Format: (request without
questions
key){ "dimensions": ["time", "hashtags"], "questions": [ { "id": 5, "text": "What is your name?", "source": { "id": 13, "authors": "Thingummy & Bob", "link": "http://ijn.com/3453295", "title": "Names and such", "year": "2001", "venue": "International Journal of Names" }, "dimensions": ["time", "author_name"] } ] }
Development Setup¶
To run this project, you can either set up your own machine or use a virtual Ubuntu machine with Vagrant. There are separate instructions for each below:
Run in a VM¶
There is configuration included to run this project inside an Ubuntu virtual machine controlled by Vagrant. This is especially recommended on Windows. If you go this route, you can skip the Manual Setup section below.
Instead, follow these steps:
- Install Vagrant and Virtualbox
- Start the virtual machine.
This will download a basic Ubuntu image, install some additional software on it, and perform the initial project setup.
Note
If you are on windows: You should run this command in an Administrator cmd.exe or Powershell.
$ vagrant up
If you are on mac: You need to make sure the setup script has executable permission.
chmod a+x setup/scripts/dev_setup.sh
- Once your Ubuntu VM is started, you can SSH into it with
vagrant ssh
. This will use a key-based authentication to log you into the VM.
You can also log in using any SSH client (e.g. PuTTY), at
localhost:2222
. The username and password are both vagrant
, or
you can also configure key-based auth: use vagrant ssh-config
to
find the private key for accessing the VM.
When you log in, your terminal will automatically drop into a Python
virtualenv and cd to /home/vagrant/textvisdrg
.
Manual Setup¶
You will need to have the following packages installed:
- MySQL 5.5
- Python 2.7 and pip
- virtualenv
- virtualenvwrapper (recommended)
- Node.js
- Bower
Once you have the above prerequisites working, clone this repository to your machine.
Go to the directory where you have cloned the repository and run the setup script, as below:
$ cd textvisdrg
$ ./setup/scripts/dev_setup.sh
This script will perform the following steps for you:
- Check that your system has the prerequisites available.
- Prompt you for database settings. If it can’t reach the database, it will give you a snippet of MySQL code needed to create the database with the supplied settings.
- Create a Python virtual environment. This keeps Python packages needed for this project from interfering with any other packages you already have installed on your system.
- Creates a
.env
file in your project directory that sets environment variables for Django, most importantly the database connection settings. - Installs python packages, NPM packages, and bower packages (using the
fab dependencies
command). - Runs the database migrations (using
fab migrate
).
Workflow¶
This page explains how to develop this software and the various
processes involved. For now, refer to the
fabfile
for useful shortcut commands.
Fabric Commands¶
Define common admin and maintenance tasks here. For more info: http://docs.fabfile.org/en/latest/
-
fabfile.
pip_install
(environment='dev')¶ Install pip requirements for an environment: test, prod, [dev]
-
fabfile.
dependencies
(default_env='dev')[source]¶ Install requirements for pip, npm, and bower all at once.
-
fabfile.
test
(settings_module='msgvis.settings.test')¶ Run tests
-
fabfile.
test_coverage
(settings_module='msgvis.settings.test')¶ Run tests with coverage
-
fabfile.
make_test_data
(outfile=path(u'/home/docs/checkouts/readthedocs.org/user_builds/textvisdrg/checkouts/latest/setup/fixtures/test_data.json'))¶ Updates the test_data.json file based on what is in the database
-
fabfile.
load_test_data
(infile=path(u'/home/docs/checkouts/readthedocs.org/user_builds/textvisdrg/checkouts/latest/setup/fixtures/test_data.json'))¶ Load test data from test_data.json
-
fabfile.
deploy
(branch=None)[source]¶ SSH into a remote server, run commands to update deployment, and start the server.
This requires that the server is already running a fairly recent copy of the code.
Furthermore, the app must use a
-
fabfile.
topic_pipeline
(dataset, name='my topic model', num_topics=30)[source]¶ Run the topic pipeline on a dataset
-
fabfile.
nltk_init
()¶ Download required nltk corpora