What is Twitter, a Social Network or a News Media?
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon
Proceedings of the 19th International World Wide Web (WWW) Conference, April 26-30, 2010, Raleigh NC (USA)
We have crawled the entire Twitter site and obtained 41.7 million user profiles, 1.47 billion social relations, 4,262 trending topics, and 106 million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effective diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks~\cite{Newman03}. In order to identify influentials on Twitter, we have ranked users by the number of followers and by PageRank and found two rankings to be similar. Ranking by retweets differs from the previous two rankings, indicating a gap in influence inferred from the number of followers and that from the popularity of one's tweets. We have analyzed the tweets of top trending topics and reported on their temporal behavior and user participation. We have classified the trending topics based on the active period and the tweets and show that the majority (over 85%) of topics are headline news or persistent news in nature. A closer look at retweets reveals that any retweeted tweet is to reach an average of 1,000 users no matter what the number of followers is of the original tweet. Once retweeted, a tweet gets retweeted almost instantly on next hops, signifying fast diffusion of information after the 1st retweet.
To the best of our knowledge this work is the first quantitative study on the entire Twittersphere and information diffusion on it.
[PDF (4.8MB)]
@inproceedings{Kwak10www,
author = {Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue},
title = "{W}hat is {T}witter, a social network or a news media?",
booktitle = {WWW '10: Proceedings of the 19th international conference on World wide web},
year = {2010},
isbn = {978-1-60558-799-8},
pages = {591--600},
location = {Raleigh, North Carolina, USA},
doi = {https://doi.acm.org/10.1145/1772690.1772751},
publisher = {ACM},
address = {New York, NY, USA},
}
Slides
Data
(for more info, read RWW's article "How Recent Changes to Twitter's Terms of Service Might Hurt Academic Research")
Social graph
- Download
* Now we offer direct download links: [GitHub]twitter_rv.tar.gz.torrent (34KB) or twitter_rv.zip.torrent (26KB) (# of seeds >= 4)twitter_rv.tar.gz, 6,475,352,982 bytes, MD5: c31b4c2d6f3ae325e516e78b499c46f8
twitter_rv.zip, 4,859,337,443 bytes, MD5: 5f2399aac71c604ac5a100fb6ca7e297
----
twitter_rv.net, 26,172,280,241 bytes, MD5: 9c0f7983a523edd1b753af68c5acc4bd
- Format
USER \t FOLLOWER \n* USER and FOLLOWER are represented by numeric ID (integer).
* These numeric IDs are the same as numeric IDs Twitter managed.
* Therefore, you can access a profile of user 12 via http://api.twitter.com/1/users/show.xml?user_id=12.
* For details, see Twitter API Page - Example
12 13
12 14
12 15
16 17
* Users 13, 14 and 15 are followers of user 12.
* User 17 is a follower of user 16.
Mapping table from numeric ID to screen name
- Download
numeric2screen.tar.gz - Format
Numeric \t Screen_name \n* You can use this data to map from numeric ID to screen name.
* Writers of tweets released by Yang and Leskovec (Helpful other websites 1.) are recorded as screen name
Restricted user profiles (> 10,000 followers)
- Download
celebrities_profiles.txt (3.0MB) (Save Link as...) - Format
- Example
protected \t location \t profile_background_color \t utc_offset \t statuses_count \t
description \t friends_count \t profile_link_color \t profile_image_url \t notifications \t
profile_background_image_url \t screen_name \t profile_background_tile \t favourites_count \t name \t
url \t created_at \t time_zone \t profile_sidebar_border_color \t following \t
gender (infered by name) \n
* For the description of each field see Returns Values page in Twitter API Wiki
* The last field, gender, is inferred by name. It can be m, f, or ?.
* "For U.S. births in 2008, the top 1000 names represent about 74 percent of all names." For detail information see Popular Baby Names in Social Security Online
Frequently Asked Questions
About torrent
- I cannot download torrent. I cannot find any seed to distribute the social graph.
We are maintaining the number of seeds more than 4.
If you cannot find any seed, it could be a problem of network configuration such as firewall in your university.
When problems occur continuously, please email me (haewoon_AT_an.kaist.ac.kr).
We provide a download link over HTTP for you.
About crawling
- How can I crawl a social graph of Twitter?
Twitter offers rich Application Programming Interface (API).
By two social graph methods (friends/ids, followers/ids) you can access an entire social graph without authentication. - But... I can send only 150 requests per hour.
While Twitter basically controls API request rate within 150 requests per hour,
you can send up to 20,000 requests per hour (per IP) once you are registered on the whitelist.
For detail information see Whitelisting section in this page - Can I get more information about users in the social graph?
For every user you can access public user profiles such as name and bio by user method (users/show) with numeric user ID in the social graph.