1. Introduction
Blogs, or Weblogs, have become increasingly popular in recent years. Blog is a Web-based publication that allows users to add content periodically, normally in reverse chronological order, in a relatively easy way.Therefore, Many communities have emerged in the blogosphere. These could be support communities such as those for technical support or educational support. In addition, there are also hate groups in blogs that are formed by bloggers who are racists or extremists. The consequences of the formation of such groups on the Internet cannot be underestimated. Beacuse Young people are the major group of bloggers, are more likely to be affected and even ‘‘brainwashed’’ by ideas propagated through the Web as a global medium.
Facing the new trend in the cyberspace, our study has two objectives.First, we propose a semi-automated approach that combines blog spidering and social network analysis techniques to facilitate the monitoring, study, and research on the networks of bloggers, especially those in hate groups.Second, our study seeks insights into the organization and movement of online hate groups.
2. Web mining and social network analysis
Techniques based on both Web mining and social network techniques have been used in intelligence-and security-related applications and achieved considerable success.Web mining techniques can be categorized into three types: content mining, structure mining, and usage mining (Kosala and Blockeel, 2000).
- Web content mining refers to the discovery of useful information from Web contents, including text, images, audio, video, etc.
- Web structure mining studies the model underlying the hyperlink structures of the Web. It usually involves the analysis of in-links and out-links information of a Web page, and has been used for search engine result ranking and other Web applications.
- Web usage mining employs data mining techniques to analyze search logs or other activity logs to find interesting patterns.
3. Proposed approach
We propose a semi-automated approach for identifying groups and analyzing their relationships in blogs. The approach is diagrammed in Fig. 1. Our approach consists of four main modules: (a). Blog Spider, (b). Information Extraction, (c). Network Analysis, and (d).Visualization. The Blog Spider module downloads blog pages from the Web. These pages are then processed by the Information Extraction module. Data about these blogs and their relationships are extracted and passed to the Network Analysis module for further analysis. Finally the Visualization module presents the analysis results to users in a graphical display. In the following, we describe each module in more detail.
3.1. Blog spider
A blog spider program is first needed to download the relevant pages from the blogs of interest. Similar to general Web fetching. Alternatively, asynchronous I/O can be used for parallel fetching (Brin and Page, 1998). In either case, after a page is downloaded it can be stored into a relational database or as a flat file. In addition, the spider can use RSS (Really Simple Syndication) and get notification when the blog is updated.
3.2. Information extraction
After a blog page has been downloaded, it is necessary to extract useful information from the page. This includes information related to the blog or the blogger, such as user profiles and date of creation. This can also include linkage information between two bloggers, such as linkage, commenting, or subscription.
3.3. Network analysis
Network analysis is a major component in our approach. In this module we propose three types of analysis: topological analysis ,centrality analysis and community analysis.
- The goal of topological analysis is to ensure that the network extracted based on links between bloggers is not random and it is meaningful to perform the centrality and community analysis. We use three statistics that are widely used in topological studies to categorize the extracted network (Albert and Baraba’ si, 2002): average shortest path length, clustering coefficient and degree distribution.
- The goal of centrality analysis is to identify the key nodes in a network. Three traditional centrality measures can be used: degree, betweenness, and closeness.
- Community analysis is to identify social groups in a network. In SNA a subset of nodes is considered a community or a social group if nodes in this group have stronger or denser links with nodes within the group than with nodes outside of the group (Wasserman and Faust, 1994).
3.4. Visualization
The extracted network and analysis results can be visualized using various types of network layout methods.
4. Case study
4.1. Focus and Methods
We applied our approach to conduct a case study of hate groups in blogs. We chose to study the hate groups against Blacks. There are two reasons for the focus. First, the nature of hate groups and hate crimes is often dependent on the target "hated" group. By focusing on a type of hate groups it is possible to identify relationships that are more prominent. Second, among different hate crimes, anti-Black hate crimes have been one of the most widely studied (e.g., Burris et al., 2000; Glaser et al., 2002). Our approach consists of four main modules:
- Spiders were used to automatically download the description page and member list of each of these groups. A total of 820 bloggers were identified from these 28 groups. The spiders further downloaded the blogs of each of these bloggers.
- The extraction program also analyzed the relationship between these bloggers. In this study, two types of relationships were extracted:
(1). Group co-membership: two bloggers belong to the same group (blogring).
(2). Subscription: blogger A subscribes to blogger B. This is a directed, binary relationship. - After collecting the blogs and extracting information from them, we performed demographical and network analysis on the data set in order to reveal the characteristics of these groups and ascertain whether any patterns exist.
- Visualization was then applied to present the results. We discuss the details of our analysis in the following sections.
4.2. Discussion
a. What are the structural properties of the social networks of bloggers in the hate groups?
Ans : Similar to the network of white supremacist Web sites (Burris et al., 2000), the network of bloggers in hate groups is decentralized.
b. Are there bloggers who stand out as leaders of influence in these groups?
Ans : Burris et al. (2000) found that the decentralized white supremacist groups had different centers of influence.
c. What is the community structure in these groups?
Ans : Communities, However, these communities are not composed of Web sites but individual bloggers. Communities provide an environment for its members to exchange their ideas and opinions and reinforce the shared ideology.
d. What do the structural properties suggest about the organization of the hate groups?
Ans : As mentioned in point (a), the structure of the network suggests that the hate groups in blogosphere have not formed into centralized organizations.
e. What are the social and political implications of these properties?
Ans :Burris et al. (2000) commented that extremist groups are a type of social movement which has profound social and political implications.
5.Conclusion and future directions
In this paper, we have discussed the problems of the emergence of hate groups and racism in blogs. Our contributions are twofold. First, we have proposed a semi-automated approach for blog analysis. Our approach consists of a set of Web mining and network analysis techniques that can be applied to the study of blogosphere. Such techniques as network topology analysis. We believe that the approach can also be applied to other domains that involve virtual community analysis and mining, which we believe would be an increasingly important field for various applications.
Second, we applied this approach to investigate the characteristic and structural relationships among the hate groups in blogs in our case study. Our study not only has provided an approach that could facilitate the analysis of law enforcement and social workers in studying and monitoring such activities, but also has brought insights into the structural properties of online hate groups and helped broaden and deepen our understanding of such a social movement.