Top 10 Python Libraries You Must Know in 2019

In this article, we will discuss some of the top libraries in Python that can be used by developers to prase, clean, and represent data and implement machine learning in their existing applications.

We will be considering the following 10 libraries:

  • TensorFlow
  • Scikit-Learn
  • Numpy
  • Keras
  • PyTorch
  • LightGBM
  • Eli5
  • SciPy
  • Theano
  • Pandas

Image title

Introduction

Python is one of the most popular and widely used programming languages and has replaced many programming languages in the industry.

There are many reasons why Python is popular among developers. However, one of the most significant is its large collection of libraries that users can work with.

The simplicity of Python has attracted many developers to create new libraries for machine learning. Because of the huge collection of libraries, Python is becoming hugely popular among machine learning experts.

So, the first library is TensorFlow.

TensorFlow

Top 10 Python Libraries - Edureka

What Is TensorFlow?

If you are currently working on a machine learning project in Python, then you may have heard about this popular open-source library known as TensorFlow.

This library was developed by Google in collaboration with the Brain Team. TensorFlow is used in almost every Google application for machine learning.

TensorFlow works like a computational library for writing new algorithms that involve a large number of tensor operations. Since neural networks can be easily expressed as computational graphs, they can be implemented using TensorFlow as a series of operations on Tensors. Plus, tensors are N-dimensional matrices that represent your data.

Features of TensorFlow

TensorFlow is optimized for speed, and it makes use of techniques like XLA for quick linear algebra operations.

1. Responsive Construct

With TensorFlow, we can easily visualize each and every part of the graph, which is not an option while using Numpy or SciKit.

2. Flexible

One of the very important Tensorflow Features is that it is flexible in its operability, meaning it has modularity, and for the parts of it that you want to make stand alone, it offers you that option.

3. Easily Trainable

It is easily trainable on CPU as well as GPU for distributed computing.

4. Parallel Neural Network Training

TensorFlow offers pipelining, in the sense that you can train multiple neural networks and multiple GPUs, which makes the models very efficient on large-scale systems.

5. Large Community

Needless to say, if it has been developed by Google, there is already a large team of software engineers who work on stability improvements continuously.

6. Open Source

The best thing about this machine learning library is that it is open source, so anyone can use it as long as they have internet connectivity.

Where Is TensorFlow Used?

You are using TensorFlow daily but indirectly with applications like Google Voice Search or Google Photos. These applications are developed using this library.

All the libraries created in TensorFlow are written in C and C++. However, it has a complicated frontend for Python. Your Python code will get compiled and then executed on TensorFlow distributed execution engine built using C and C++.

The number of applications of TensorFlow is literally unlimited, and that is the beauty of TensorFlow.

Scikit-Learn

Top 10 Python Libraries - Edureka

What Is Scikit-learn?

It is a Python library is associated with NumPy and SciPy. It is considered one of the best libraries for working with complex data.

There are a lot of changes being made in this library. One modification is the cross-validation feature, providing the ability to use more than one metric. Lots of training methods like logistics regression and nearest neighbors have received some little improvements.

Features Of Scikit-Learn

1. Cross-validation: There are various methods to check the accuracy of supervised models on unseen data.

2.Unsupervised learning algorithms: Again, there is a large spread of algorithms in the offering — starting from clustering, factor analysis, and principal component analysis to unsupervised neural networks.

3. Feature extraction: Useful for extracting features from images and text (e.g. Bag of words

Where Is Scikit-Learn Used?

It contains a numerous number of algorithms for implementing standard machine learning and data mining tasks like reducing dimensionality, classification, regression, clustering, and model selection.

Numpy

Top 10 Python Libraries - Edureka

What Is Numpy?

Numpy is considered one of the most popular machine learning libraries in Python.

TensorFlow and other libraries use Numpy internally for performing multiple operations on Tensors. Array interface is the best and the most important feature of Numpy.

Features Of Numpy

  1. Interactive: Numpy is very interactive and easy to use
  2. Mathematics: Makes complex mathematical implementations very simple
  3. Intuitive: Makes coding real easy and grasping the concepts is easy
  4. Lots of Interaction: Widely used, hence a lot of open source contribution

Where Is Numpy Used?

This interface can be utilized for expressing images, sound waves, and other binary raw streams as an array of real numbers in N-dimensional.

For implementing this library for machine learning, having knowledge of Numpy is important for full-stack developers.

Keras

Top 10 Python Libraries - Edureka

What Is Keras?

Keras is considered one of the coolest machine learning libraries in Python. It provides an easier mechanism to express neural networks. Keras also provides some of the best utilities for compiling models, processing data-sets, visualization of graphs, and much more.

In the backend, Keras uses either Theano or TensorFlow internally. Some of the most popular neural networks like CNTK can also be used. Keras is comparatively slow when we compare it with other machine learning libraries because it creates a computational graph by using back-end infrastructure and then makes use of it to perform operations. All the models in Keras are portable.

Features Of Keras

  • It runs smoothly on both CPU and GPU.
  • Keras supports almost all the models of a neural network — fully connected, convolutional, pooling, recurrent, embedding, etc. Furthermore, these models can be combined to build more complex models.
  • Keras, being modular in nature, is incredibly expressive, flexible, and apt for innovative research.
  • Keras is a completely Python-based framework, which makes it easy to debug and explore.

Where Is Keras Used?

You are already constantly interacting with features built with Keras — it is in use at Netflix, Uber, Yelp, Instacart, Zocdoc, Square, and many others. It is especially popular among startups that place deep learning at the core of their products.

Keras contains numerous implementations of commonly used neural network building blocks such as layers, objectives, activation functions, optimizers and a host of tools to make working with image and text data easier.

Plus, it provides many pre-processed data-sets and pre-trained models like MNIST, VGG, Inception, SqueezeNet, ResNet, etc.

Keras is also a favorite among deep learning researchers, coming in at #2. Keras has also been adopted by researchers at large scientific organizations, in particular, CERN and NASA.

PyTorch

Top 10 Python Libraries - Edureka

What Is PyTorch?

PyTorch is the largest machine learning library that allows developers to perform tensor computations with the acceleration of GPU, creates dynamic computational graphs, and calculate gradients automatically. Other than this, PyTorch offers rich APIs for solving application issues related to neural networks.

This machine learning library is based on Torch, which is an open-source machine library implemented in C with a wrapper in Lua.

This machine library, in Python, was introduced in 2017, and since its inception, the library is gaining popularity and attracting an increasing number of machine learning developers.

Features Of PyTorch

Hybrid Front-End

A new hybrid frontend provides ease-of-use and flexibility in eager mode, while seamlessly transitioning to graph mode for speed, optimization, and functionality in C++ runtime environments.

Distributed Training

Optimize performance in both research and production by taking advantage of native support for asynchronous execution of collective operations and peer-to-peer communication that is accessible from Python and C++.

Python First

PyTorch is not a Python binding into a monolithic C++ framework. It’s built to be deeply integrated into Python so it can be used with popular libraries and packages such as Cython and Numba.

Libraries and Tools

An active community of researchers and developers have built a rich ecosystem of tools and libraries for extending PyTorch and supporting development in areas from computer vision to reinforcement learning.

Where Is PyTorch Used?

PyTorch is primarily used for applications such as natural language processing.

It is primarily developed by Facebook’s artificial-intelligence research group and Uber’s “Pyro” software for probabilistic programming is built on it.

PyTorch is outperforming TensorFlow in multiple ways and it is gaining a lot of attention in recent days.

LightGBM

Top 10 Python Libraries - Edureka

What Is LightGBM?

Gradient Boosting is one of the best and most popular machine learning(ML) library, which helps developers in building new algorithms by using redefined elementary models and namely decision trees. Therefore, there are special libraries that are designed for fast and efficient implementation of this method.

These libraries are LightGBM, XGBoost, and CatBoost. All these libraries are competitors that help in solving a common problem and can be utilized in almost a similar manner.

Features of LightGBM

Very fast computation ensures high production efficiency.

Intuitive, hence makes it user-friendly.

Faster training than many other deep learning libraries.

Will not produce errors when you consider NaN values and other canonical values.

Where Is LightGBM Used?

This library provides highly scalable, optimized, and fast implementations of gradient boosting, which makes it popular among machine learning developers. Because most of the machine learning full-stack developers won machine learning competitions by using these algorithms.

Eli5

Top 10 Python Libraries - Edureka

What Is Eli5?

Most often, the results of machine learning model predictions are not accurate, and Eli5 machine learning library built-in Python helps in overcoming this challenge. It is a combination of visualization and debugs all the machine learning models and tracks all working steps of an algorithm.

Features of Eli5

Moreover, Eli5 supports other libraries XGBoost, lightning, scikit-learn, and sklearn-crfsuite libraries. All the above-mentioned libraries can be used to perform different tasks using each one of them.

Where Is Eli5 Used?

  • Mathematical applications that require a lot of computation in a short time.
  • Eli5 plays a vital role where there are dependencies with other Python packages.
  • Legacy applications and implementing newer methodologies in various fields.

SciPy

Top 10 Python Libraries - Edureka

What Is SciPy?

SciPy is a machine learning library for application developers and engineers. However, you still need to know the difference between SciPy library and SciPy stack. SciPy library contains modules for optimization, linear algebra, integration, and statistics.

Features Of SciPy

The main feature of the SciPy library is that it is developed using NumPy, and its array makes the most use of NumPy.

In addition, SciPy provides all the efficient numerical routines like optimization, numerical integration, and many others using its specific submodules.

All the functions in all submodules of SciPy are well documented.

Where Is SciPy Used?

SciPy is a library that uses NumPy for the purpose of solving mathematical functions. SciPy uses NumPy arrays as the basic data structure and comes with modules for various commonly used tasks in scientific programming.

Tasks including linear algebra, integration (calculus), ordinary differential equation solving and signal processing are handled easily by SciPy.

Theano

Top 10 Python Libraries - Edureka

What Is Theano?

Theano is a computational framework machine learning library in Python for computing multidimensional arrays. Theano works similar to TensorFlow, but it not as efficient as TensorFlow. Because of its inability to fit into production environments.

Moreover, Theano can also be used on a distributed or parallel environments just similar to TensorFlow.

Features Of Theano

  • Tight integration with NumPy – Ability to use completely NumPy arrays in Theano-compiled functions.
  • Transparent use of a GPU – Perform data-intensive computations much faster than on a CPU.
  • Efficient symbolic differentiation – Theano does your derivatives for functions with one or many inputs.
  • Speed and stability optimizations – Get the right answer for log(1+x) even when x is very tiny. This is just one of the examples to show the stability of Theano.
  • Dynamic C code generation – Evaluate expressions faster than ever before, thereby increasing efficiency by a lot.
  • Extensive unit-testing and self-verification – Detect and diagnose multiple types of errors and ambiguities in the model.

Where Is Theano Used?

The actual syntax of Theano expressions is symbolic, which can be off-putting to beginners used to normal software development. Specifically, an expression is defined in the abstract sense, compiled, and later actually used to make calculations.

It was specifically designed to handle the types of computation required for large neural network algorithms used in Deep Learning. It was one of the first libraries of its kind (development started in 2007) and is considered an industry standard for Deep Learning research and development.

Theano is being used in multiple neural network projects today, and the popularity of Theano is only growing with time.

Pandas

Top 10 Python Libraries - Edureka

What Is Pandas?

Pandas is a machine learning library in Python that provides data structures of high-level and a wide variety of tools for analysis. One of the great features of this library is the ability to translate complex operations with data using one or two commands. Pandas has so many inbuilt methods for grouping, combining data, filtering, as well as time-series functionality.

All these are followed by outstanding speed indicators.

Features Of Pandas

Pandas makes sure that the entire process of manipulating data will be easier. Support for operations such as Re-indexing, Iteration, Sorting, Aggregations, Concatenations, and Visualizations are among the feature highlights of Pandas.

Where Is Pandas Used?

Currently, there are fewer releases of the Pandas library, which includes hundreds of new features, bug fixes, enhancements, and changes in API. The improvements in Pandas are its ability to group and sort data, select the best-suited output for the applied method, and provide support for performing custom types operations.

Data Analysis, among everything else, takes the highlight when it comes to using Pandas. But when used with other libraries and tools, Pandas ensures high functionality and a good amount of flexibility.

That’s it, folks! I hope this article helped you kickstart your learning the libraries available in Python.

搭建 Hadoop 伪分布式环境

软硬件环境

  • CentOS 7.2 64位
  • OpenJDK-1.7
  • Hadoop-2.7

关于本教程的说明

云实验室云主机自动使用root账户登录系统,因此本教程中所有的操作都是以root用户来执行的。若要在自己的云主机上进行本教程的实验,为了系统安全,建议新建一个账户登录后再进行后续操作。

安装 SSH 客户端

任务时间:1min ~ 5min

安装SSH

安装SSH:

sudo yum install openssh-clients openssh-server

安装完成后,可以使用下面命令进行测试:

ssh localhost

输入root账户的密码,如果可以正常登录,则说明SSH安装没有问题。测试正常后使用exit命令退出ssh。

安装 JAVA 环境

任务时间:5min ~ 10min

安装 JDK

使用yum来安装1.7版本OpenJDK:

sudo yum install java-1.7.0-openjdk java-1.7.0-openjdk-devel

安装完成后,输入javajavac命令,如果能输出对应的命令帮助,则表明jdk已正确安装。

配置 JAVA 环境变量

执行命令:

编辑 ~/.bashrc,在结尾追加:

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk

保存文件后执行下面命令使JAVA_HOME环境变量生效:

source ~/.bashrc

为了检测系统中JAVA环境是否已经正确配置并生效,可以分别执行下面命令:

java -version
$JAVA_HOME/bin/java -version

若两条命令输出的结果一致,且都为我们前面安装的openjdk-1.7.0的版本,则表明JDK环境已经正确安装并配置。

安装 Hadoop

任务时间:10min ~ 15min

下载 Hadoop

本教程使用hadoop-2.7版本,使用wget工具在线下载(注:本教程是从清华大学的镜像源下载,如果下载失败或报错,可以自己在网上找到国内其他一个镜像源下载2.7版本的hadoop即可):

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz

安装 Hadoop

将Hadoop安装到/usr/local目录下:

tar -zxf hadoop-2.7.4.tar.gz -C /usr/local

对安装的目录进行重命名,便于后续操作方便:

cd /usr/local
mv ./hadoop-2.7.4/ ./hadoop

检查Hadoop是否已经正确安装:

/usr/local/hadoop/bin/hadoop version

如果成功输出hadoop的版本信息,表明hadoop已经成功安装。

Hadoop 伪分布式环境配置

任务时间:15min ~ 30min

Hadoop伪分布式模式使用多个守护线程模拟分布的伪分布运行模式。

设置 Hadoop 的环境变量

编辑 ~/.bashrc,在结尾追加如下内容:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

使Hadoop环境变量配置生效:

source ~/.bashrc

修改 Hadoop 的配置文件

Hadoop的配置文件位于安装目录的/etc/hadoop目录下,在本教程中即位于/url/local/hadoop/etc/hadoop目录下,需要修改的配置文件为如下两个:

/usr/local/hadoop/etc/hadoop/core-site.xml
/usr/local/hadoop/etc/hadoop/hdfs-site.xml

编辑 core-site.xml,修改<configuration></configuration>节点的内容为如下所示:

示例代码:/usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>location to store temporary files</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

同理,编辑 hdfs-site.xml,修改<configuration></configuration>节点的内容为如下所示:

示例代码:/usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

格式化 NameNode

格式化NameNode:

/usr/local/hadoop/bin/hdfs namenode -format

在输出信息中看到如下信息,则表示格式化成功:

Storage directory /usr/local/hadoop/tmp/dfs/name has been successfully formatted.
Exiting with status 0

启动 NameNode 和 DataNode 守护进程

启动NameNode和DataNode进程:

/usr/local/hadoop/sbin/start-dfs.sh

执行过程中会提示输入用户密码,输入root用户密码即可。另外,启动时ssh会显示警告提示是否继续连接,输入yes即可。

检查 NameNode 和 DataNode 是否正常启动:

jps

如果NameNode和DataNode已经正常启动,会显示NameNode、DataNode和SecondaryNameNode的进程信息:

[hadoop@VM_80_152_centos ~]$ jps
3689 SecondaryNameNode
3520 DataNode
3800 Jps
3393 NameNode

运行 Hadoop 伪分布式实例

任务时间:10min ~ 20min

Hadoop自带了丰富的例子,包括 wordcount、grep、sort 等。下面我们将以grep例子为教程,输入一批文件,从中筛选出符合正则表达式dfs[a-z.]+的单词并统计出现的次数。

查看 Hadoop 自带的例子

Hadoop 附带了丰富的例子, 执行下面命令可以查看:

cd /usr/local/hadoop
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar

在 HDFS 中创建用户目录

在 HDFS 中创建用户目录 hadoop:

/usr/local/hadoop/bin/hdfs dfs -mkdir -p /user/hadoop

准备实验数据

本教程中,我们将以 Hadoop 所有的 xml 配置文件作为输入数据来完成实验。执行下面命令在 HDFS 中新建一个 input 文件夹并将 hadoop 配置文件上传到该文件夹下:

cd /usr/local/hadoop
./bin/hdfs dfs -mkdir /user/hadoop/input
./bin/hdfs dfs -put ./etc/hadoop/*.xml /user/hadoop/input

使用下面命令可以查看刚刚上传到 HDFS 的文件:

/usr/local/hadoop/bin/hdfs dfs -ls /user/hadoop/input

运行实验

运行实验:

cd /usr/local/hadoop
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar grep /user/hadoop/input /user/hadoop/output 'dfs[a-z.]+'

上述命令以 HDFS 文件系统中的 input 为输入数据来运行 Hadoop 自带的 grep 程序,提取其中符合正则表达式 dfs[a-z.]+ 的数据并进行次数统计,将结果输出到 HDFS 文件系统的 output 文件夹下。

查看运行结果

上述例子完成后的结果保存在 HDFS 中,通过下面命令查看结果:

/usr/local/hadoop/bin/hdfs dfs -cat /user/hadoop/output/*

如果运行成功,可以看到如下结果:

1       dfsadmin
1       dfs.replication
1       dfs.namenode.name.dir
1       dfs.datanode.data.dir

删除 HDFS 上的输出结果

删除 HDFS 中的结果目录:

/usr/local/hadoop/bin/hdfs dfs -rm -r /user/hadoop/output

运行 Hadoop 程序时,为了防止覆盖结果,程序指定的输出目录不能存在,否则会提示错误,因此在下次运行前需要先删除输出目录。

关闭 Hadoop 进程

关闭 Hadoop 进程:

/usr/local/hadoop/sbin/stop-dfs.sh

再起启动只需要执行下面命令:

/usr/local/hadoop/sbin/start-dfs.sh

部署完成

任务时间:时间未知

大功告成

恭喜您已经完成了搭建 Hadoop 伪分布式环境的学习

Facebook针对hbase的优化方案分析

使用hbase的目的是为了海量数据的随机读写,但是在实际使用中却发现针对随机读的优化和gc是一个很大的问题,而且hbase的数据是存储在Hdfs,而Hdfs是面向流失数据访问进行设计的,就难免带来效率的下降。下面介绍一下Facebook Message系统在HBase online storage场景下的一个案例(《Apache Hadoop Goes Realtime at Facebook》, SIGMOD 2011),最近他们在存储领域顶级会议FAST2014上发表了一篇论文《Analysis of HDFS Under HBase: A Facebook Messages Case Study》分析了他们在使用HBase中遇到的一些问题和解决方案。该论文首先讲了Facebook的分析方法包括tracing/analysis/simulation,FM系统的架构和文件与数据构成等,接下来开始分析FM系统在性能方面的一些问题,并提出了解决方案。

7e05f2a60f826648e732fb0de6eeb88b9a41ed00

FM系统的主要读写I/O负载

Figure 2描述了每一层的I/O构成,解释了在FM系统对外请求中读占主导,但是由于logging/compaction/replication/caching导致写被严重放大。

  • HBase的设计是分层结构的,依次是DB逻辑层、FS逻辑层、底层系统逻辑层。DB逻辑层提供的对外使用的接口主要操作是put()和get()请求,这两个操作的数据都要写到HDFS上,其中读写比99/1(Figure 2中第一条)。
  • 由于DB逻辑层内部为了保证数据的持久性会做logging,为了读取的高效率会做compaction,而且这两个操作都是写占主导的,所以把这两个操作(overheads)加上之后读写比为79/21(Figure 2中第二条)。
  • 相当于调用put()操作向HBase写入的数据都是写入了两份:一份写入内存Memstore然后flush到HFile/HDFS,另一份通过logging直接写HLog/HDFS。Memstore中积累一定量的数据才会写HFile,这使得压缩比会比较高,而写HLog要求实时append record导致压缩比(HBASE-8155)相对较低,导致写被放大4倍以上。    Compaction操作就是读取小的HFile到内存merge-sorting成大的HFile然后输出,加速HBase读操作。Compaction操作导致写被放大17倍以上,说明每部分数据平均被重复读写了17次,所以对于内容不变的大附件是不适合存储在HBase中的。由于读操作在FM业务中占主要比例,所以加速读操作对业务非常有帮助,所以compaction策略会比较激进。
    HBase的数据reliable是靠HDFS层保证的,即HDFS的三备份策略。那么也就是上述对HDFS的写操作都会被转化成三倍的local file I/O和两倍的网络I/O。这样使得在本地磁盘I/O中衡量读写比变成了55/45。
  • 然而由于对本地磁盘的读操作请求的数据会被本地OS的cache缓存,那么真正的读操作是由于cache miss引起的读操作的I/O量,这样使得读写比变成了36/64,写被进一步放大。    另外Figure 3从I/O数据传输中真正业务需求的数据大小来看各个层次、各个操作引起的I/O变化。除了上面说的,还发现了整个系统最终存储在磁盘上有大量的cold data(占2/3),所以需要支持hot/cold数据分开存储。
7e05f2a60f826648e732fb0de6eeb88b9a41ed00

总的来说,HBase stack的logging/compaction/replication/caching会放大写I/O,导致业务逻辑上读为主导的HBase系统在地层实际磁盘I/O中写占据了主导。
FM系统的主要文件类型和大小  

40f2cdd5037aa74c7f96f02cada56e3ca247f708

FM系统的几种文件类型如Table 2所示,这个是纯业务的逻辑描述。在HBase的每个RegionServer上的每个column family对应一个或者多个HFile文件。FM系统中有8个column family,由于每个column family存储的数据的类型和大小不一样,使得每个column family的读写比是不一样的。而且很少数据是读写都会请求的,所以cache all writes可能作用不大(Figure 4)。
ea019d03c555e91a66960da7f61b9426d4a96444

对于每个column family的文件,90%是小于15M的。但是少量的特别大的文件会拉高column family的平均文件大小。例如MessageMeta这个column family的平均文件大小是293M。从这些文件的生命周期来看,大部分FM的数据存储在large,long-lived files,然而大部分文件却是small, short-lived。这对HDFS的NameNode提出了很大的挑战,因为HDFS设计的初衷是为了存储少量、大文件准备的,所有的文件的元数据是存储在NameNode的内存中的,还有有NameNode federation。
FM系统的主要I/O访问类型

下面从temporal locality, spatial locality, sequentiality的角度来看。
73.7%的数据只被读取了一次,但是1.1%的数据被读取了至少64次。也就是说只有少部分的数据被重复读取了。但是从触发I/O的角度,只有19%的读操作读取的是只被读取一次的数据,而大部分I/O是读取那些热数据。
在HDFS这一层,FM读取数据没有表现出sequentiality,也就是说明high-bandwidth, high-latency的机械磁盘不是服务读请求的理想存储介质。而且对数据的读取也没有表现出spatial locality,也就是说I/O预读取也没啥作用。
解决方案1. Flash/SSD作为cache使用

095d39b45168460de2f41581e4f271d96eb56e94

下面就考虑怎么架构能够加速这个系统了。目前Facebook的HBase系统每个Node挂15块100MB/s带宽、10ms寻址时间的磁盘。Figure 9表明:a)增加磁盘块数有点用;b)增加磁盘带宽没啥大用;c)降低寻址时间非常有用。
由于少部分同样的数据会被经常读取,所以一个大的cache能够把80%左右的读取操作拦截而不用触发磁盘I/O,而且只有这少部分的hot data需要被cache。那么拿什么样的存储介质做cache呢?Figure 11说明如果拿足够大的Flash做二级缓存,cache命中率会明显提高,同时cache命中率跟内存大小关系并不大。
注:关于拿Flash/SSD做cache,可以参考HBase BucketBlockCache(HBASE-7404)
17b348cd8343d187910d175838ec68caaae9dc22
我们知道大家比较关心Flash/SSD寿命的问题,在内存和Flash中shuffling数据能够使得最热的数据被交换到内存中,从而提升读性能,但是会降低Flash的寿命,但是随着技术的发展这个问题带来的影响可能越来越小。
说完加速读的cache,接着讨论了Flash作为写buffer是否会带来性能上的提升。由于HDFS写操作只要数据被DataNode成功接收到内存中就保证了持久性(因为三台DataNode同时存储,所以认为从DataNode的内存flush到磁盘的操作不会三个DataNode都失败),所以拿Flash做写buffer不会提高性能。虽然加写buffer会使后台的compaction操作降低他与前台服务的I/O争用,但是会增加很大复杂度,所以还是不用了。最后他们给出了结论就是拿Flash做写buffer没用。
然后他们还计算了,在这个存储栈中加入Flash做二级缓存不但能提升性能达3倍之多,而且只需要增加5%的成本,比加内存性价比高很多。
2.分层架构的缺点和改进方案
c51484c944de4521dece47cfd616da656a8fe54a

如Figure 16所示,一般分布式数据库系统分为三个层次:db layer/replication layer/local layer。这种分层架构的最大优点是简洁清晰,每层各司其职。例如db layer只需要处理DB相关的逻辑,底层的存储认为是available和reliable的。
HBase是图中a)的架构,数据的冗余replication由HDFS来负责。但是这个带来一个问题就是例如compaction操作会读取多个三备份的小文件到内存merge-sorting成一个三备份的大文件,这个操作只能在其中的一个RS/DN上完成,那么从其他RS/DN上的数据读写都会带来网络传输I/O。
图中b)的架构就是把replication层放到了DB层的上面,Facebook举的例子是Salus,不过我对这个东西不太熟悉。我认为Cassandra就是这个架构的。这个架构的缺点就是DB层需要处理底层文件系统的问题,还要保证和其他节点的DB层协调一致,太复杂了。
图中c)的架构是在a的基础上的一种改进,Spark使用的就是这个架构。HBase的compaction操作就可以简化成join和sort这样两个RDD变换。

f43f47a9afb7a6fc49c810e9144e9b88af618793
Figure 17展示了local compaction的原理,原来的网络I/O的一半转化成了本地磁盘读I/O,而且可以利用读cache加速。我们都知道在数据密集型计算系统中网络交换机的I/O瓶颈非常大,例如MapReduce Job中Data Shuffle操作就是最耗时的操作,需要强大的网络I/O带宽。加州大学圣迭戈分校(UCSD)微软亚洲研究院(MSRA)都曾经设计专门的数据中心网络拓扑来优化网络I/O负载,相关研究成果在计算机网络顶级会议SIGCOMM上发表了多篇论文,但是由于其对网络路由器的改动伤筋动骨,最后都没有成功推广开来。
d1dc231e39134be01ad3dfafe7d28161efcabbc9
Figure 19展示了combined logging的原理。现在HBase的多个RS会向同一个DataNode发送写log请求,而目前DataNode端会把来自这三个RS的log分别写到不同的文件/块中,会导致该DataNode磁盘seek操作较多(不再是磁盘顺序I/O,而是随机I/O)。Combined logging就是把来自不同RS的log写到同一个文件中,这样就把DataNode的随机I/O转化成了顺序I/O。

Building TensorFlow for Raspberry Pi: a Step-By-Step Guide

What You Need

  • Raspberry Pi 2 or 3 Model B
  • An SD card running Raspbian with several GB of free space
    • An 8 GB card with a fresh install of Raspbian does not have enough space. A 16 GB SD card minimum is recommended.
    • These instructions may work on Linux distributions other than Raspbian
  • Internet connection to the Raspberry Pi
  • A USB memory drive that can be installed as swap memory (if it is a flash drive, make sure you don’t care about the drive). Anything over 1 GB should be fine
  • A fair amount of time

Overview

These instructions were crafted for a Raspberry Pi 3 Model B running a vanilla copy of Raspbian 8.0 (jessie). It appears to work on Raspberry Pi 2, but there are some kinks that are being worked out. If these instructions work for different distributions, let me know!

Here’s the basic plan: build a 32-bit version of Protobuf, use that to build a RPi-friendly version of Bazel, and finally use Bazel to build TensorFlow.

The Build

1. Install basic dependencies

First, update apt-get to make sure it knows where to download everything.

sudo apt-get update

Next, install some base dependencies and tools we’ll need later.

For Protobuf:

sudo apt-get install autoconf automake libtool maven

For gRPC:

sudo apt-get install oracle-java7-jdk
# Select the jdk-7-oracle option for the update-alternatives command
sudo update-alternatives --config java

For Bazel:

sudo apt-get install pkg-config zip g++ zlib1g-dev unzip

For TensorFlow:

# For Python 2.7
sudo apt-get install python-pip python-numpy swig python-dev
sudo pip install wheel

# For Python 3.3+
sudo apt-get install python3-pip python3-numpy swig python3-dev
sudo pip3 install wheel

To be able to take advantage of certain optimization flags:

sudo apt-get install gcc-4.8 g++-4.8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.8 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.8 100

Finally, for cleanliness, make a directory that will hold the Protobuf, Bazel, and TensorFlow repositories.

mkdir tf
cd tf

2. Build Protobuf

Clone the Protobuf repository.

git clone https://github.com/google/protobuf.git

Now move into the new protobuf directory, configure it, and make it. Note: this takes a little while.

cd protobuf
git checkout v3.0.0-beta-3.3
./autogen.sh
./configure --prefix=/usr
make -j 4
sudo make install

Once it’s made, we can move into the java directory and use Maven to build the project.

cd java
mvn package

After following these steps, you’ll have two spiffy new files: /usr/bin/protoc and protobuf/java/core/target/protobuf-java-3.0.0-beta3.jar

3. Build gRPC

Next, we need to build gRPC-Java, the Java implementation of gRPC. Move out of the protobuf/java directory and clone gRPC’s repository.

cd ../..
git clone https://github.com/grpc/grpc-java.git
cd grpc-java
git checkout v0.14.1
cd compiler
nano build.gradle

Around line 47:

gcc(Gcc) {
    target("linux_arm-v7") {
        cppCompiler.executable = "/usr/bin/gcc"
    }
}

Around line 60, add section for 'linux_arm-v7':

...
    x86_64 {
        architecture "x86_64"
    }
    'linux_arm-v7' {
        architecture "arm32"
        operatingSystem "linux"
    }

Around line 64, add 'arm32' to list of architectures:

...
components {
    java_plugin(NativeExecutableSpec) {
            if (arch in ['x86_32', 'x86_64', 'arm32'])
...

Around line 148, replace content inside of protoc section to hard code path to protoc binary:

protoc {
    path = '/usr/bin/protoc'
}

Once all of that is taken care of, run this command to build gRPC:

../gradlew java_pluginExecutable

4. Build Bazel

First, move out of the grpc-java/compiler directory and clone Bazel’s repository.

cd ../..
git clone https://github.com/bazelbuild/bazel.git

Next, go into the new bazel directory and immediately checkout version 0.3.1 of Bazel.

cd bazel
git checkout 0.3.2

After that, copy the generated Protobuf and gRPC files we created earlier into the Bazel project. Note the naming of the files in this step- it must be precise.

sudo cp /usr/bin/protoc third_party/protobuf/protoc-linux-arm32.exe
sudo cp ../protobuf/java/core/target/protobuf-java-3.0.0-beta-3.jar third_party/protobuf/protobuf-java-3.0.0-beta-1.jar
sudo cp ../grpc-java/compiler/build/exe/java_plugin/protoc-gen-grpc-java third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-x86_32.exe

Before building Bazel, we need to set the javac maximum heap size for this job, or else we’ll get an OutOfMemoryError. To do this, we need to make a small addition to bazel/scripts/bootstrap/compile.sh. (Shout-out to @SangManLINUX for pointing this out..

nano scripts/bootstrap/compile.sh

Around line 46, you’ll find this code:

if [ "${MACHINE_IS_64BIT}" = 'yes' ]; then
    PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-x86_64.exe}
    GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-x86_64.exe}
else
    if [ "${MACHINE_IS_ARM}" = 'yes' ]; then
        PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-arm32.exe}
    else
        PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-x86_32.exe}
        GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-x86_32.exe}
    fi
fi

Change it to the following:

if [ "${MACHINE_IS_64BIT}" = 'yes' ]; then
    PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-x86_64.exe}
    GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-x86_64.exe}
else
    PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-arm32.exe}
    GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-linux-arm32.exe}
fi

Move down to line 149, where you’ll see the following block of code:

run "${JAVAC}" -classpath "${classpath}" -sourcepath "${sourcepath}" \
      -d "${output}/classes" -source "$JAVA_VERSION" -target "$JAVA_VERSION" \
      -encoding UTF-8 "@${paramfile}"

At the end of this block, add in the -J-Xmx500M flag, which sets the maximum size of the Java heap to 500 MB:

run "${JAVAC}" -classpath "${classpath}" -sourcepath "${sourcepath}" \
      -d "${output}/classes" -source "$JAVA_VERSION" -target "$JAVA_VERSION" \
      -encoding UTF-8 "@${paramfile}" -J-Xmx500M

Next up, we need to adjust third_party/protobuf/BUILD – open it up in your text editor.

nano third_party/protobuf/BUILD

We need to add this last line around line 29:

...
    "//third_party:freebsd": ["protoc-linux-x86_32.exe"],
    "//third_party:arm": ["protoc-linux-arm32.exe"],
}),
...

Finally, we have to add one thing to tools/cpp/cc_configure.bzl – open it up for editing:

nano tools/cpp/cc_configure.bzl

And place this in around line 141 (at the beginning of the _get_cpu_value function):

...
"""Compute the cpu_value based on the OS name."""
return "arm"
...

Now we can build Bazel! Note: this also takes some time.

sudo ./compile.sh

When the build finishes, you end up with a new binary, output/bazel. Copy that to your /usr/local/bin directory.

sudo mkdir /usr/local/bin
sudo cp output/bazel /usr/local/bin/bazel

To make sure it’s working properly, run bazel on the command line and verify it prints help text. Note: this may take 15-30 seconds to run, so be patient!

$ bazel

Usage: bazel <command> <options> ...

Available commands:
  analyze-profile     Analyzes build profile data.
  build               Builds the specified targets.
  canonicalize-flags  Canonicalizes a list of bazel options.
  clean               Removes output files and optionally stops the server.
  dump                Dumps the internal state of the bazel server process.
  fetch               Fetches external repositories that are prerequisites to the targets.
  help                Prints help for commands, or the index.
  info                Displays runtime info about the bazel server.
  mobile-install      Installs targets to mobile devices.
  query               Executes a dependency graph query.
  run                 Runs the specified target.
  shutdown            Stops the bazel server.
  test                Builds and runs the specified test targets.
  version             Prints version information for bazel.

Getting more help:
  bazel help <command>
                   Prints help and options for <command>.
  bazel help startup_options
                   Options for the JVM hosting bazel.
  bazel help target-syntax
                   Explains the syntax for specifying targets.
  bazel help info-keys
                   Displays a list of keys used by the info command.

Move out of the bazel directory, and we’ll move onto the next step.

cd ..

5. Install a Memory Drive as Swap for Compiling

In order to succesfully build TensorFlow, your Raspberry Pi needs a little bit more memory to fall back on. Fortunately, this process is pretty straightforward. Grab a USB storage drive that has at least 1GB of memory. I used a flash drive I could live without that carried no important data. That said, we’re only going to be using the drive as swap while we compile, so this process shouldn’t do too much damage to a relatively new USB drive.

First, put insert your USB drive, and find the /dev/XXX path for the device.

sudo blkid

As an example, my drive’s path was /dev/sda1

Once you’ve found your device, unmount it by using the umount command.

sudo umount /dev/XXX

Then format your device to be swap:

sudo mkswap /dev/XXX

If the previous command outputted an alphanumeric UUID, copy that now. Otherwise, find the UUID by running blkid again. Copy the UUID associated with /dev/XXX

sudo blkid

Now edit your /etc/fstab file to register your swap file. (I’m a Vim guy, but Nano is installed by default)

sudo nano /etc/fstab

On a separate line, enter the following information. Replace the X’s with the UUID (without quotes)

UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX none swap sw,pri=5 0 0

Save /etc/fstab, exit your text editor, and run the following command:

sudo swapon -a

If you get an error claiming it can’t find your UUID, go back and edit /etc/fstab. Replace the UUID=XXX.. bit with the original /dev/XXX information.

sudo nano /etc/fstab
# Replace the UUID with /dev/XXX
/dev/XXX none swap sw,pri=5 0 0

Alright! You’ve got swap! Don’t throw out the /dev/XXX information yet- you’ll need it to remove the device safely later on.

6. Compiling TensorFlow

First things first, clone the TensorFlow repository and move into the newly created directory.

git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd tensorflow

Note: if you’re looking to build to a specific version or commit of TensorFlow (as opposed to the HEAD at master), you should git checkout it now.

Once in the directory, we have to write a nifty one-liner that is incredibly important. The next line goes through all files and changes references of 64-bit program implementations (which we don’t have access to) to 32-bit implementations. Neat!

grep -Rl 'lib64' | xargs sed -i 's/lib64/lib/g'

Next, we need to delete a particular line in tensorflow/core/platform/platform.h. Open up the file in your favorite text editor:

$ sudo nano tensorflow/core/platform/platform.h

Now, scroll down toward the bottom and delete the following line containing #define IS_MOBILE_PLATFORM:

#elif defined(__arm__)
#define PLATFORM_POSIX
...
#define IS_MOBILE_PLATFORM   <----- DELETE THIS LINE

This keeps our Raspberry Pi device (which has an ARM CPU) from being recognized as a mobile device.

Now let’s configure Bazel:

$ ./configure

Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] N
Do you wish to build TensorFlow with GPU support? [y/N] N

Note: if you want to build for Python 3, specify /usr/bin/python3 for Python’s location.

Now we can use it to build TensorFlow! Warning: This takes a really, really long time. Several hours.

bazel build -c opt --copt="-mfpu=neon-vfpv4" --copt="-funsafe-math-optimizations" --copt="-ftree-vectorize" --local_resources 1024,1.0,1.0 --verbose_failures tensorflow/tools/pip_package:build_pip_package

Note: I toyed around with telling Bazel to use all four cores in the Raspberry Pi, but that seemed to make compiling more prone to completely locking up. This process takes a long time regardless, so I’m sticking with the more reliable options here. If you want to be bold, try using --local_resources 1024,2.0,1.0 or --local_resources 1024,4.0,1.0

When you wake up the next morning and it’s finished compiling, you’re in the home stretch! Use the built binary file to create a Python wheel.

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

And then install it!

sudo pip install /tmp/tensorflow_pkg/tensorflow-0.10-cp27-none-linux_armv7l.whl

7. Cleaning Up

There’s one last bit of house-cleaning we need to do before we’re done: remove the USB drive that we’ve been using as swap.

First, turn off your drive as swap:

sudo swapoff /dev/XXX

Finally, remove the line you wrote in /etc/fstab referencing the device

sudo nano /etc/fstab

Then reboot your Raspberry Pi.

And you’re done! You deserve a break.

Hadoop快速入门

目的

这篇文档的目的是帮助你快速完成单机上的Hadoop安装与使用以便你对Hadoop分布式文件系统(HDFS)和Map-Reduce框架有所体会,比如在HDFS上运行示例程序或简单作业等。

 

先决条件

 

支持平台

  • GNU/Linux是产品开发和运行的平台。 Hadoop已在有2000个节点的GNU/Linux主机组成的集群系统上得到验证。
  • Win32平台是作为开发平台支持的。由于分布式操作尚未在Win32平台上充分测试,所以还不作为一个生产平台被支持。

 

所需软件

Linux和Windows所需软件包括:

  1. JavaTM1.5.x,必须安装,建议选择Sun公司发行的Java版本。
  2. ssh 必须安装并且保证 sshd一直运行,以便用Hadoop 脚本管理远端Hadoop守护进程。

Windows下的附加软件需求

  1. Cygwin – 提供上述软件之外的shell支持。

 

安装软件

如果你的集群尚未安装所需软件,你得首先安装它们。

以Ubuntu Linux为例:

$ sudo apt-get install ssh
$ sudo apt-get install rsync

在Windows平台上,如果安装cygwin时未安装全部所需软件,则需启动cyqwin安装管理器安装如下软件包:

  • openssh – Net

 

下载

为了获取Hadoop的发行版,从Apache的某个镜像服务器上下载最近的 稳定发行版

 

运行Hadoop集群的准备工作

解压所下载的Hadoop发行版。编辑 conf/hadoop-env.sh文件,至少需要将JAVA_HOME设置为Java安装根路径。

尝试如下命令:
$ bin/hadoop
将会显示hadoop 脚本的使用文档。

现在你可以用以下三种支持的模式中的一种启动Hadoop集群:

  • 单机模式
  • 伪分布式模式
  • 完全分布式模式

 

单机模式的操作方法

默认情况下,Hadoop被配置成以非分布式模式运行的一个独立Java进程。这对调试非常有帮助。

下面的实例将已解压的 conf 目录拷贝作为输入,查找并显示匹配给定正则表达式的条目。输出写入到指定的output目录。
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output ‘dfs[a-z.]+’
$ cat output/*

 

伪分布式模式的操作方法

Hadoop可以在单节点上以所谓的伪分布式模式运行,此时每一个Hadoop守护进程都作为一个独立的Java进程运行。

配置

使用如下的 conf/hadoop-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>localhost:9000</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

 

免密码ssh设置

现在确认能否不输入口令就用ssh登录localhost:
$ ssh localhost

如果不输入口令就无法用ssh登陆localhost,执行下面的命令:
$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

执行

格式化一个新的分布式文件系统:
$ bin/hadoop namenode -format

启动Hadoop守护进程:
$ bin/start-all.sh

Hadoop守护进程的日志写入到 ${HADOOP_LOG_DIR} 目录 (默认是 ${HADOOP_HOME}/logs).

浏览NameNode和JobTracker的网络接口,它们的地址默认为:

将输入文件拷贝到分布式文件系统:
$ bin/hadoop fs -put conf input

运行发行版提供的示例程序:
$ bin/hadoop jar hadoop-*-examples.jar grep input output ‘dfs[a-z.]+’

查看输出文件:

将输出文件从分布式文件系统拷贝到本地文件系统查看:
$ bin/hadoop fs -get output output
$ cat output/*

或者

在分布式文件系统上查看输出文件:
$ bin/hadoop fs -cat output/*

完成全部操作后,停止守护进程:
$ bin/stop-all.sh

 

完全分布式模式的操作方法

关于搭建完全分布式模式的,有实际意义的集群的资料可以在这里找到。

HFile存储格式

HBase中的所有数据文件都存储在Hadoop HDFS文件系统上,主要包括两种文件类型:

1. HFile, HBase中KeyValue数据的存储格式,HFile是Hadoop的二进制格式文件,实际上StoreFile就是对HFile做了轻量级包装,即StoreFile底层就是HFile

2. HLog File,HBase中WAL(Write Ahead Log) 的存储格式,物理上是Hadoop的Sequence File

下面主要通过代码理解一下HFile的存储格式。

HFile

下图是HFile的存储格式:

HFile由6部分组成的,其中数据KeyValue保存在Block 0 … N中,其他部分的功能有:确定Block Index的起始位置;确定某个key所在的Block位置(如block index);判断一个key是否在这个HFile中(如Meta Block保存了Bloom Filter信息)。具体代码是在HFile.java中实现的,HFile内容是按照从上到下的顺序写入的(Data Block、Meta Block、File Info、Data Block Index、Meta Block Index、Fixed File Trailer)。

KeyValue: HFile里面的每个KeyValue对就是一个简单的byte数组。但是这个byte数组里面包含了很多项,并且有固定的结构。我们来看看里面的具体结构:

开始是两个固定长度的数值,分别表示Key的长度和Value的长度。紧接着是Key,开始是固定长度的数值,表示RowKey的长度,紧接着是 RowKey,然后是固定长度的数值,表示Family的长度,然后是Family,接着是Qualifier,然后是两个固定长度的数值,表示Time Stamp和Key Type(Put/Delete)。Value部分没有这么复杂的结构,就是纯粹的二进制数据了。

Data Block:由DATABLOCKMAGIC和若干个record组成,其中record就是一个KeyValue(key length, value length, key, value),默认大小是64k,小的数据块有利于随机读操作,而大的数据块则有利于scan操作,这是因为读KeyValue的时候,HBase会将查询到的data block全部读到Lru Block Cache中去,而不是仅仅将这个record读到cache中去。

private void append(final byte [] key, final int koffset, final int klength, final byte [] value, final int voffset, final int vlength) throws IOException {

this.out.writeInt(klength);

this.keylength += klength;

this.out.writeInt(vlength);

this.valuelength += vlength;

this.out.write(key, koffset, klength);

this.out.write(value, voffset, vlength);

}

Meta Block:由METABLOCKMAGIC和Bloom Filter信息组成。

public void close() throws IOException {

if (metaNames.size() > 0) {

for (int i = 0 ; i < metaNames.size() ; ++ i ) {

dos.write(METABLOCKMAGIC);

metaData.get(i).write(dos);

}

}

}

File Info: 由MapSize和若干个key/value,这里保存的是HFile的一些基本信息,如hfile.LASTKEY, hfile.AVG_KEY_LEN, hfile.AVG_VALUE_LEN, hfile.COMPARATOR。

private long writeFileInfo(FSDataOutputStream o) throws IOException {

if (this.lastKeyBuffer != null) {

// Make a copy.  The copy is stuffed into HMapWritable.  Needs a clean

// byte buffer.  Won’t take a tuple.

byte [] b = new byte[this.lastKeyLength];

System.arraycopy(this.lastKeyBuffer, this.lastKeyOffset, b, 0, this.lastKeyLength);

appendFileInfo(this.fileinfo, FileInfo.LASTKEY, b, false);

}

int avgKeyLen = this.entryCount == 0? 0: (int)(this.keylength/this.entryCount);

appendFileInfo(this.fileinfo, FileInfo.AVG_KEY_LEN, Bytes.toBytes(avgKeyLen), false);

int avgValueLen = this.entryCount == 0? 0: (int)(this.valuelength/this.entryCount);

appendFileInfo(this.fileinfo, FileInfo.AVG_VALUE_LEN,

Bytes.toBytes(avgValueLen), false);

appendFileInfo(this.fileinfo, FileInfo.COMPARATOR, Bytes.toBytes(this.comparator.getClass().getName()), false);

long pos = o.getPos();

this.fileinfo.write(o);

return pos;

}

Data/Meta Block Index: 由INDEXBLOCKMAGIC和若干个record组成,而每一个record由3个部分组成 — block的起始位置,block的大小,block中的第一个key。

static long writeIndex(final FSDataOutputStream o, final List<byte []> keys, final List<Long> offsets, final List<Integer> sizes) throws IOException {

long pos = o.getPos();

// Don’t write an index if nothing in the index.

if (keys.size() > 0) {

o.write(INDEXBLOCKMAGIC);

// Write the index.

for (int i = 0; i < keys.size(); ++i) {

o.writeLong(offsets.get(i).longValue());

o.writeInt(sizes.get(i).intValue());

byte [] key = keys.get(i);

Bytes.writeByteArray(o, key);

}

}

return pos;

}

Fixed file trailer: 大小固定,主要是可以根据它查找到File Info, Block Index的起始位置。

public void close() throws IOException {

trailer.fileinfoOffset = writeFileInfo(this.outputStream);

trailer.dataIndexOffset = BlockIndex.writeIndex(this.outputStream,

this.blockKeys, this.blockOffsets, this.blockDataSizes);

if (metaNames.size() > 0) {

trailer.metaIndexOffset = BlockIndex.writeIndex(this.outputStream,

this.metaNames, metaOffsets, metaDataSizes);

}

trailer.dataIndexCount = blockKeys.size();

trailer.metaIndexCount = metaNames.size();

trailer.totalUncompressedBytes = totalBytes;

trailer.entryCount = entryCount;

trailer.compressionCodec = this.compressAlgo.ordinal();

trailer.serialize(outputStream);

}

注:上面的代码剪切自HFile.java中的代码,更多信息可以查看Hbase源代码。

参考:http://www.searchtb.com/2011/01/understanding-hbase.html

http://th30z.blogspot.com/2011/02/hbase-io-hfile.html

 

分布式文件系统Ceph调研1 – RADOS

Ceph是加州大学Santa Cruz分校的Sage Weil(DreamHost的联合创始人)专为博士论文设计的新一代自由软件分布式文件系统。自2007年毕业之后,Sage开始全职投入到Ceph开 发之中,使其能适用于生产环境。Ceph的主要目标是设计成基于POSIX的没有单点故障的分布式文件系统,使数据能容错和无缝的复制。2010年3 月,Linus Torvalds将Ceph client合并到内 核2.6.34中。

Ceph中有很多在分布式系统领域非常新颖的技术点,对解决分布式文件系统中一些常见的问题的研究非常有指导意义。所以值得研究。

RADOS简介

1 RADOS概述

RADOS (Reliable, Autonomic Distributed Object Store) 是Ceph的核心之一,作为Ceph分布式文件系统的一个子项目,特别为Ceph的需求设计,能够在动态变化和异质结构的存储设备机群之上提供一种稳定、可扩展、高性能的单一逻辑对象(Object)存储接口和能够实现节点的自适应和自管理的存储系统。事实上,RADOS也可以单独作为一种分布式数据存储系统,给适合相应需求的分布式文件系统提供数据存储服务。

2 RADOS架构简介

RADOS系统主要由两个部分组成(如图1所示):

1.由数目可变的大规模OSDs(Object Storage Devices)组成的机群,负责存储所有的Objects数据;

2.由少量Monitors组成的强耦合、小规模机群,负责管理Cluster Map,其中Cluster Map是整个RADOS系统的关键数据结构,管理机群中的所有成员、关系、属性等信息以及数据的分发。

图1 RADOS系统架构图示

对于RADOS系统,节点组织管理和数据分发策略均有内部的Monitors全权负责,所以,从Clients角度设计相对比较简单,它给应用提供的仅为简单的存储接口。

3 RADOS详细介绍

3.1 扩展机群

1.Cluster Map

存储机群的管理,唯一的途径是Cluster Map通过对Monitor Cluster操作完成。Cluster Map是整个RADOS系统的核心数据结构,其中指定了机群中的OSDs信息和所有数据的分布情况。所有涉及到RADOS系统的Storage节点和Clients都有最新epoch的Cluster Map副本。因为Cluster Map的特殊性,Client向上提供了非常简单的接口实现将整个存储机群抽象为单一的逻辑对象存储结构。

Cluster Map的更新由OSD的状态变化或者其他事件造成数据层的变化驱动,每一次Cluster Map更新都需要将map epoch增加,map epoch使Cluster Map在所有节点上的副本都保持同步,同时,map epoch可以使一些过期的Cluster Map能够通过通信对等节点及时更新。

在大规模的分布式系统中,OSDs的failures/recoveries是常见的,所以,Cluster Map的更新就比较频繁,如果将整个Cluster Map进行分发或广播显然会造成资源的浪费,RADOS采用分发incremental map的策略避免资源浪费,其中incremental map仅包含了两个连续epoch之间Cluster Map的增量信息。

2.Data Placement

数据迁移:当有新的储存设备加入时,机群上的数据会随机的选出一部分迁移到新的设备上,维持现有存储结构的平衡。

数据分发:通过两个阶段的计算得到合适的Object的存储位置。如图2所示。

图2 数据分发图示

1.Object到PG的映射。PG (Placement Group)是Objects的逻辑集合。相同PG里的Object会被系统分发到相同的OSDs集合中。由Object的名称通过Hash算法得到的结果结合其他一些修正参数可以得到Object所对应的PG。

2.RADOS系统根据根据Cluster Map将PGs分配到相应的OSDs。这组OSDs正是PG中的Objects数据的存储位置。RADOS采用CRUSH算法实现了一种稳定、伪随机的hash算法。CRUSH实现了平衡的和与容量相关的数据分配策略。CRUSH得到的一组OSDs还不是最终的数据存储目标,需要经过初步的filter,因为对于大规模的分布式机群,宕机等原因使得部分节点可能失效,filter就是为过滤这些节点,如果过滤后存储目标不能满足使用则阻塞当前操作。

3.Device State

Cluster Map中关于Device State的描述见下表所示。

表1 Device State描述

in out
assigned PGs not assigned PGs
up online & reachable active online & idle
down unreachable unreachable & not remapped failed

4.Map propagate

Cluster Map在OSD之间的更新是通过一种抢占式的方法进行。Cluster Map epoch的差异只有在两个通信实体之间有意义,两个通信实体在进行信息交换之前都需要交换epoch,保证Cluster Map的同步。这一属性使得Cluster Map在通信实体内部之间的更新分担了全局的Cluster Map分发压力。

每一个OSD都会缓存最近Cluster Map和到当前时刻的所有incremental map信息,OSD的所有message都会嵌入incremental map,同时侦听与其通信的peer的Cluster Map epoch。当从peer收到的message中发现其epoch是过期的,OSD share相对peer来说的incremental map,使通信的peers都保持同步;同样的,当从peer收到message中发现本地epoch过期,从其嵌入到message中的incremental map中分析得到相对本地的incremental map然后更新,保持同步。

不是同一个通信对等方的两个OSD之间的epoch差异,不影响同步。

3.2 智能存储

1Replication

RADOS实现了三种不同的Replication方案,见下图3示:

图3 RADOS实现的三种replication方案

Primary-copy:读写操作均在primary OSD上进行,并行更新replicas;

Chain:链式读写,读写分离;

Spaly:Primary-copy和Chain的折中方案:并行更新replicas和读写分离。

2Consistency

一致性问题主要有两个方面,分别是Update和Read:

  1. Update:在RADOS系统中所有Message都嵌入了发送端的map epoch协调机群的一致性。
  2. Read:为避免部分OSD失效导致数据不能从该OSD读需要转向新的OSD,但是read operation的发起方还没有该OSD的失效信息的问题,同一个PG所在的OSDs需要实时交换Heartbeat。

3Failure Detection

错误检测:RADOS采取异步、有序的点对点Heartbeat。(此处的错误检测是OSDs自身检测)

4Data Migration & Failure Recovery

由于设备失效、机群扩展、错误恢复造成的Cluster Map更新使得PG到OSDs的对应关系发生了变化,一旦Cluster Map发生变化,相应的OSDs上的数据也需要做相应的调整。

数据的移植和数据恢复都是由Primary OSD负责统一完成。

(Data Migration & Failure Recovery具体方法待续)

3.3 Monitors

Monitors是Cluster Map主备份存储目标,所有其他位置上的Cluster Map最初都是从Monitors请求得到。Monitors通过对Cluster Map的周期更新升级实现存储机群的管理。

Monitor的工作分两个阶段:

1.首先在多个Monitors中选举Leader,之后Leader向所有Monitors请求Map Epoch,Monitors周期性向Leader汇报结果并告知其活跃(Active Monitor),Leader统计Quorum。这阶段的意义是保证所有的Monitors的Map Epoch都是最新的,通过Incremental updates对已失效的Cluster Map进行更新。

2.Leader周期向每一个Active Monitor授权许可提供分发Cluster Map副本给OSDs和Clients的服务。当授权失效但Leader仍没有重新分发认为Leader died,此时重回第一阶段进行Leader重选;当Active Monitor没有周期向Leader反馈ACK则认为有Monitor died,重回第一阶段进行Leader选举并更新Quorum。Leader周期分发Lease和Active Monitor周期反馈ACK的另外一个作用是同步Monitors的Cluster Map。Active Monitor收到Update请求时,首先验证当前的Epoch是否为最新,如果不是,更新后向上汇报到Leader,Leader分发给所有的Monitors,同时回收授权,重新开始新一轮的Leader选举到Cluster Map服务。

通常Monitor的负载比较小:OSDs上的Cluster Map更新通过OSDs之间的机制实现;OSDs的状态变化比较罕见不会对Monitors的负载造成影响。但是一些特殊情况可能会对Monitors负载带来影响,比如:同时有n OSDs failed,每一个OSD store m个PGs,此时会形成m×n个failure report到达Monitors,对于规模较大的机群这样的数据量比较大。为避免这种情况给Monitor带来的负载压力,OSDs采用伪随机的时间间隔交错安排failure检测(此处是从OSDs到Monitor的检测)向上汇报,另外根据Monitors的并行化和负载均衡分配的特点,扩展Monitors是解决Monitors的负载压力的另一措施。

4 总结

与传统的分布式数据存储不同,RADOS最大的特点是:

1.将文件映射到Objects后利用Cluster Map通过CRUSH计算而不是查找表方式定位文件数据在存储设备中的位置。省去了传统的File到Block的映射和BlockMap管理。

2.RADOS充分利用了OSDs的智能特点,将部分任务授权给OSDs,最大程度的实现可扩展。

5 参考文献

[1]     RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.

[2]     Ceph: A Scalable, High-Performance Distributed File System.

 

Lucene Query Parser

Lucene Query Parser


翻译这篇文章的初衷是希望能更系统的理解Lucene的用法,同时试试自己的翻译水平:)


原文:http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


概述


虽然Lucene提供的API允许你创建你自己的Query(查询语句),但它同时也通过Query Parser(查询分析器)提供了丰富的查询语言。


这个页面提供的是Lucene的Query Parser的语法介绍:一个可通过用JavaCC把一个字符串解释成为Lucene的查询语句的规则。


在选择使用被提供的Query Parser前,请考虑一下几点:


1、如果你是通过编写程序生成一个查询语句,然后通过Query Parser分析,那么你需要认真的考虑是否该直接利用Query的API构造你的查询。换句话说,Query Parser是为那些人工输入的文本所设计的,而不是为了程序生成的文本。


2、未分词的字段最好直接加到Query中,而不要通过Query Parser。如果一个字段的值是由程序生成的,那么需要为这个字段生成一个Query Clause(查询子句)。Query Parser所用的Analyzer是为转换人工输入的文本为分词的。而程序生成的值,比如日期、关键字等,一般都由程序直接生成(?)。


3、在一个查询表单里,通常是文本的字段应该使用Query Parser。所有其他的,比如日期范围、关键字等等,最好是通过Query API直接加入到Query中。一个有有限个值的字段,比如通过下拉表单定义的那些,不应该被加到查询字串中(后面会分析到),而应该被添加为一个TermQuery子句。


分词


一个查询语句是有分词和操作符组成的。这里有两种类型的:单个的分词和短语。


一个单一分词就是一个简单的单词,比如”test”或”hello”。


一个短语就是一组被双引号包括的单词,比如”hello dolly”。


多个分词可以用布尔操作符组合起来形成一个更复杂的查询语句(下面会详细介绍)。


注意:用于建立索引的分析器(Analyzer)将被用于解释查询语句中的分词和短语。因此,合理的选择一个分析器是很重要的,当然这不会影响你在查询语句中使用的分词。


字段


Lucene支持字段数据。当执行一个搜索时,你可以指定一个字段,或者使用默认的字段。字段的名字和默认的字段是取决于实现细节的。


你可以搜索任何字段,做法是输入字段名称,结尾跟上一个冒号 “:” , 然后输入你想查找的分词。


举个例子,让我们假设Lucene的索引包含两个字段,标题和正文,正文是默认字段。如果你想标题为 “The Right Way” 并且正文包含文本 “don’t go this way”的记录的话,你可以输入:


  title:”The Right Way” AND text:go 


或者


  title:”Do it right” AND right


由于正文是默认字段,所以字段指示就没有必要了。


注意:字段的搜索值只能是紧跟在冒号后的一个分词,所以搜索字串:


 title:Do it right


只会找到标题含有 “Do” 的记录。它会在默认字段(这里是正文字段)里查找 “it” 和 “right”。


分词修饰语


Lucene支持对查询分词做修饰以提供一个更广的查询选项。
 
通配符搜索


ucene支持一个或多个字符的通配符搜索。


执行一个字符的通配符搜索可以用 “?”,比如你要搜索 “text” or “test” ,你可以用:


 te?t


执行多个字符的通配符搜索可以用 “*”,比如,搜索 “test” , “tests” 或 “tester” ,你可以用


 test*


你也可以把通配符放在分词中将进行搜索,比如


 te*t


注意:不可以使用 * 或 ? 作为搜索字串的第一个支付。


模糊搜索


Lucene支持基于Levenshtein Distance(一种字符串相似程度算法)或Edit Distance(一种字符串相似程度算法)算法的模糊查询。要执行模糊查询需要用到 “~” 符号,紧随单个分词。比如,用模糊查询搜索一个拼写上和 “roam” 类似的内容:


 roam~


这个搜索可以搜索到像 foam 和 roams 这样的分词。


 “jakarta apache”~10


 近似搜索


Lucene支持根据一个指定的近似程度查找。与模糊搜索类似,在短语的末尾使用 “~” 。


范围搜索


范围搜索可以搜索到那些列的值在被范围搜索所指定的上限和下限之间的记录。范围搜索可以包括或不包括上限和下限。如果不是日期类型的字段,会根据字典的排序来决定上下限。


 mod_date:[20020101 TO 20030101]


这个会搜索 mod_date 字段的值在 20020101 和 20030101 之间的,包括上下限。注意,范围搜索不仅仅是为日期字段准备的,你也可以在非日期的字段上使用它:


 title:{Aida TO Carmen}


这个会搜索 title 字段的值在 Aida 和 Carmen (根据字典的顺序) 的记录,不包括 Aida 和 Carmen 本身。


包含的范围查询是用方括号指定的 “[” 、 “]” ,而不包含的范围查询是有波形括号指定的 “{” 、 “}”。


提高一个分词的相似度


Lucene提供基于被搜索的分词的匹配文档的相似度这一概念。为了提高一个分词的相似度,可以使用 “^”和一个相似因子(一个数字)跟在需要搜索的分词后面。这个匹配因子越高,这个分词所获得相似度就越高。


提高匹配允许你通过提高分词的相似程度来控制一个文档的相似程度。比如,如果你在搜索


 jakarta apache


并且,你希望 “jakarta” 有更高的相似程度,就用让 “^” 和一个匹配因子紧随它。你将输入:


 jakarta^4 apache


这样会使包含分词 jakarta 的文档有更高的相似度(Lucene默认是按照相似度对搜索结果排序的,所以带来的直接影响就是该搜索结果文档排位靠前)。你也可以对短语分词提高匹配度,比如:


 “jakarta apache”^4 “jakarta lucene”


默认的情况下,匹配因子是1。虽然匹配因子必须是正数,当它可以小于1 (比如 0.2)


布尔操作符


布尔操作符允许分词通过逻辑操作符被组合。Lucene支持 AND, “+”, OR, NOT and “-” 作为布尔操作符(注意:布尔操作符必须全部大写)。


OR


OR操作符是默认的连接操作。这意味着如果没有布尔值在两个分词之间的话,OR操作符将被使用。OR操作符连接两个分词,并且搜索包含任意一个分词的文档。这等价于与集合概念中的并集。 “||” 可以替代 OR。


搜索包含 “jakarta apache” 或者 只是 “jakarta” 的文档使用查询语句:


“jakarta apache” jakarta


或者


 “jakarta apache” OR jakarta


AND


AND操作符会匹配那些包含所有分词的文档。这等价与集合概念中的交集。”&&” 可以替代 AND。


搜索既包含 “jakarta apache” 又包含 “jakarta” 的文档使用查询语句:


 “jakarta apache” AND “jakarta lucene”
 
+作为必需操作符,需要 “+” 后的分词必须存在于被搜索文档的某个字段中。


搜索必须含有 “jakarta” 同时可能含有 “lucene” 的文档使用查询语句:


 +jakarta apache


NOT


NOT操作符会排除包含NOT后的分词的文档。这等价于集合概念中的补集。”!” 可以替代 NOT。


搜索包含”jakarta apache” 但不包含 “jakarta lucene”的文档使用查询语句:


 “jakarta apache” NOT “jakarta lucene”


注意:NOT操作符不能只和一个分词一起使用,下面这个查询不会返回任何结果:


 NOT “jakarta apache”


“-“作为禁止操作符将不匹配包含”-“所跟随的分词的文档。


搜索包含”jakarta apache” 而不包含 “jakarta lucene” 使用查询语句:


 “jakarta apache” -“jakarta lucene”
 
分组


Lucene支付使用圆括号把子句分组形成子查询。如果你想控制一个查询语句的布尔逻辑,那么这个将非常有用。


查找包含 “jakarta” 或者 “apache” , 同时包含 “website” 使用查询语句:


 (jakarta OR apache) AND website


这个将会排除任何混淆,并确定被搜索的文档必须包括 website 这个分词并且包括 jakarta 或者 apache 中的一个。
 
字段分组


Lucene支持针对单一字段使用圆括号对多个子句分组。


要搜索一个标题既包含分词”return” , 又包含短语 “pink panther” 使用查询语句:


特殊字符转换


作为查询语法的一部分,Lucene支持特殊字符。当前特殊支付包括:


+ – && || ! ( ) { } [ ] ^ ” ~ * ? : \


要转化一个特殊字符,就在要转化的字符前加上 “\” , 例如,搜索 (1+1):2 使用查询语句:


 \(1\+1\)\:2


原文:http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


索引有哪些优点?

索引有哪些优点?
1、  通过创建唯一性索引,可以保证数据库表中每一行数据的唯一性。


2、  可以大大加快数据的检索速度,这也是创建索引的最主要原因。


3、  可以加速表和表之间的连接,这在实现数据的参考完整性方面特别有意义。


4、  在使用分组和排序子句进行数据检索时,同样可以显著减少查询中分组和排序的时间。


索引有哪些缺点?
1、  创建索引和维护索引要耗费时间,这种时间随着数据量的增加而增加。


2、  除了数据表占数据空间之外,每一个索引还要占一定的物理空间,如果要建立聚簇索引,需要的空间就会更大。


3、  当对表中的数据进行增加、删除和修改的时候,索引也要动态的维护,这样就降低了数据的维护速度。


索引有哪些类型?
1、 普通索引


这是最基本的索引类型,而且它没有唯一性之类的限制。


2、 唯一性索引


这种索引和前面的“普通索引”基本相同,但有一个区别:索引列的所有值都只能出现一次,即必须唯一。


3、主键


它是一种特殊的唯一索引,不允许有空值。


4、全文索引


MySQL从3.23.23版开始支持全文索引和全文检索。


单列索引和组合索引:
单列索引就是把索引单独建立在一个字段上。


组合索引复合索引就是一个索引创建在两个列或者多个列上。在搜索时,当两个或者多个列作为一个关键值时,最好在这些列上创建复合索引。


建立和使用索引有哪些注意事项:
1、           索引要建立在经常进行select操作的字段上。这是因为,如果这些列很少用到,那么有无索引并不能明显改变查询速度。相反,由于增加了索引,反而降低了系统的维护速度和增大了空间需求。


2、           索引要建立在值比较唯一的字段上。这样做才是发挥索引的最大效果。,比如主键的id字段,唯一的名字name字段等等。如果索引建立在唯一值比较少的字段,比如性别gender字段,寥寥无几的类别字段等,刚索引几乎没有任何意义。


3、           对于那些定义为text、 image和bit数据类型的列不应该增加索引。因为这些列的数据量要么相当大,要么取值很少。


4、           当修改性能远远大于检索性能时,不应该创建索引。修改性能和检索性能是互相矛盾的。当增加索引时,会提高检索性能,但是会降低修改性能。当减少索引时,会提高修改性能,降低检索性能。因此,当修改性能远远大于检索性能时,不应该创建索引。


5、           在WHERE和JOIN中出现的列需要建立索引。


6、           在以通配符 % 和 _ 开头作查询时,MySQL索引是无效的。但是这样索引是有效的:select * from tbl1 where name like ‘xxx%’,所以谨慎的写你的SQL是很重要的。


lucene简单实例<二>

 写文章的时候,感觉比较难写的就是标题,有时候不知道起什么名字好,反正这里写的都是关于lucene的一些简单的实例,就随便起啦.

Lucene 其实很简单的,它最主要就是做两件事:建立索引和进行搜索
来看一些在lucene中使用的术语,这里并不打算作详细的介绍,只是点一下而已—-因为这一个世界有一种好东西,叫搜索。

IndexWriter:lucene中最重要的的类之一,它主要是用来将文档加入索引,同时控制索引过程中的一些参数使用。

Analyzer:分析器,主要用于分析搜索引擎遇到的各种文本。常用的有StandardAnalyzer分析器,StopAnalyzer分析器,WhitespaceAnalyzer分析器等。

Directory:索引存放的位置;lucene提供了两种索引存放的位置,一种是磁盘,一种是内存。一般情况将索引放在磁盘上;相应地lucene提供了FSDirectory和RAMDirectory两个类。

Document:文档;Document相当于一个要进行索引的单元,任何可以想要被索引的文件都必须转化为Document对象才能进行索引。

Field:字段。

IndexSearcher:是lucene中最基本的检索工具,所有的检索都会用到IndexSearcher工具;

Query:查询,lucene中支持模糊查询,语义查询,短语查询,组合查询等等,如有TermQuery,BooleanQuery,RangeQuery,WildcardQuery等一些类。

QueryParser: 是一个解析用户输入的工具,可以通过扫描用户输入的字符串,生成Query对象。

Hits:在搜索完成之后,需要把搜索结果返回并显示给用户,只有这样才算是完成搜索的目的。在lucene中,搜索的结果的集合是用Hits类的实例来表示的。

上面作了一大堆名词解释,下面就看几个简单的实例吧:
1、简单的的StandardAnalyzer测试例子



Java代码


  1. package lighter.javaeye.com;  

  2.   

  3. import java.io.IOException;  

  4. import java.io.StringReader;  

  5.   

  6. import org.apache.lucene.analysis.Analyzer;  

  7. import org.apache.lucene.analysis.Token;  

  8. import org.apache.lucene.analysis.TokenStream;  

  9. import org.apache.lucene.analysis.standard.StandardAnalyzer;  

  10.   

  11. public class StandardAnalyzerTest   

  12. {  

  13.     //构造函数,  

  14.     public StandardAnalyzerTest()  

  15.     {  

  16.     }  

  17.     public static void main(String[] args)   

  18.     {  

  19.         //生成一个StandardAnalyzer对象  

  20.         Analyzer aAnalyzer = new StandardAnalyzer();  

  21.         //测试字符串  

  22.         StringReader sr = new StringReader(“lighter javaeye com is the are on”);  

  23.         //生成TokenStream对象  

  24.         TokenStream ts = aAnalyzer.tokenStream(“name”, sr);   

  25.         try {  

  26.             int i=0;  

  27.             Token t = ts.next();  

  28.             while(t!=null)  

  29.             {  

  30.                 //辅助输出时显示行号  

  31.                 i++;  

  32.                 //输出处理后的字符  

  33.                 System.out.println(“第”+i+“行:”+t.termText());  

  34.                 //取得下一个字符  

  35.                 t=ts.next();  

  36.             }  

  37.         } catch (IOException e) {  

  38.             e.printStackTrace();  

  39.         }  

  40.     }  

  41. }  

显示结果:

引用

第1行:lighter
第2行:javaeye
第3行:com

提示一下:
StandardAnalyzer是lucene中内置的”标准分析器”,可以做如下功能:
1、对原有句子按照空格进行了分词
2、所有的大写字母都可以能转换为小写的字母
3、可以去掉一些没有用处的单词,例如”is”,”the”,”are”等单词,也删除了所有的标点
查看一下结果与”new StringReader(“lighter javaeye com is the are on”)”作一个比较就清楚明了。
这里不对其API进行解释了,具体见lucene的官方文档。需要注意一点,这里的代码使用的是lucene2的API,与1.43版有一些明显的差别。

2、看另一个实例,简单地建立索引,进行搜索



Java代码


  1. package lighter.javaeye.com;  

  2. import org.apache.lucene.analysis.standard.StandardAnalyzer;  

  3. import org.apache.lucene.document.Document;  

  4. import org.apache.lucene.document.Field;  

  5. import org.apache.lucene.index.IndexWriter;  

  6. import org.apache.lucene.queryParser.QueryParser;  

  7. import org.apache.lucene.search.Hits;  

  8. import org.apache.lucene.search.IndexSearcher;  

  9. import org.apache.lucene.search.Query;  

  10. import org.apache.lucene.store.FSDirectory;  

  11.   

  12. public class FSDirectoryTest {  

  13.   

  14.     //建立索引的路径  

  15.     public static final String path = “c:\\index2”;  

  16.   

  17.     public static void main(String[] args) throws Exception {  

  18.         Document doc1 = new Document();  

  19.         doc1.add( new Field(“name”“lighter javaeye com”,Field.Store.YES,Field.Index.TOKENIZED));  

  20.   

  21.         Document doc2 = new Document();  

  22.         doc2.add(new Field(“name”“lighter blog”,Field.Store.YES,Field.Index.TOKENIZED));  

  23.   

  24.         IndexWriter writer = new IndexWriter(FSDirectory.getDirectory(path, true), new StandardAnalyzer(), true);  

  25.         writer.setMaxFieldLength(3);  

  26.         writer.addDocument(doc1);  

  27.         writer.setMaxFieldLength(3);  

  28.         writer.addDocument(doc2);  

  29.         writer.close();  

  30.   

  31.         IndexSearcher searcher = new IndexSearcher(path);  

  32.         Hits hits = null;  

  33.         Query query = null;  

  34.         QueryParser qp = new QueryParser(“name”,new StandardAnalyzer());  

  35.           

  36.         query = qp.parse(“lighter”);  

  37.         hits = searcher.search(query);  

  38.         System.out.println(“查找\”lighter\” 共” + hits.length() + “个结果”);  

  39.   

  40.         query = qp.parse(“javaeye”);  

  41.         hits = searcher.search(query);  

  42.         System.out.println(“查找\”javaeye\” 共” + hits.length() + “个结果”);  

  43.   

  44.     }  

  45.   

  46. }  

运行结果:



Java代码


  1. 查找“lighter” 共2个结果  

  2. 查找“javaeye” 共1个结果