创建表
语法: create '表名', '列族名', ...
create 'ORDER_INFO', 'C1', 'C2'
- 创建一个表,表名为
ORDER_INFO
,列族为C1
和C2
。 - 表可以有多个列族(Column Family)。
查看所有表
命令:
list
- 列出所有当前存在的表。
启用表
语法: enable '表名'
enable 'ORDER_INFO'
- 启用一个已禁用的表。
禁用表
语法: disable '表名'
disable 'ORDER_INFO'
- 禁用一个表,必须在删除前进行禁用。
删除表
语法: drop '表名'
drop 'ORDER_INFO'
- 删除一个已禁用的表。
显示表描述
语法: describe '表名'
describe 'ORDER_INFO'
示例:
hbase:031:0> describe 'ORDER_INFO'
Table ORDER_INFO is ENABLED
ORDER_INFO, {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'C1', INDEX_BLOCK_ENCODING => 'NONE', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW',
IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536 B (64KB)'}
{NAME => 'C2', INDEX_BLOCK_ENCODING => 'NONE', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW',
IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536 B (64KB)'}
2 row(s)
Quota is disabled
Took 0.1003 seconds
- 显示指定表的详细结构信息。
修改表结构
语法: alter '表名', { NAME => '列族名', VERSIONS => 版本数 }
alter 'ORDER_INFO', { NAME => 'C1', VERSIONS => 5 }
示例:
hbase:032:0> alter 'ORDER_INFO',{NAME => 'C1' , VERSIONS => '5'}
Updating all regions with the new schema...
1/1 regions updated.
Done.
Took 1.8954 seconds
- 修改表的结构,例如调整列族的版本数。
插入数据
语法: put '表名', '行键', '列族:列', '值'
put 'ORDER_INFO', 'row1', 'C1:order_id', '12345'
示例:
hbase:002:0> put 'ORDER_INFO', 'row1', 'C1:order_id', '12345'
Took 0.1551 seconds
- 插入一行数据到表中,
order_id
列的值为'12345'
。
获取数据
语法: get '表名', '行键'
get 'ORDER_INFO', 'row1'
- 获取指定行键的数据。
- 显示中文数据可用
{FORMATTER => 'toString'}
。
get 'ORDER_INFO', 'row1', {FORMATTER => 'toString'}
示例:
hbase:039:0> get 'ORDER_INFO', 'row1', {FORMATTER => 'toString'}
COLUMN CELL
C1:order_date timestamp=2024-10-15T12:52:37.435, value=2024-10-01
C1:order_id timestamp=2024-10-15T12:52:37.409, value=12345
C2:customer_name timestamp=2024-10-15T12:52:37.460, value=Alice
C2:customer_phone timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0159 seconds
扫描表
避免扫描大表,以免程序运行时间过长、内存不足,甚至导致节点死机。
全表扫描
语法: scan '表名'
scan 'ORDER_INFO'
示例:
hbase:040:0> scan 'ORDER_INFO'
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T12:52:37.409, value=12345
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
row2 column=C1:order_date, timestamp=2024-10-15T12:52:37.531, value=2024-10-02
row2 column=C1:order_id, timestamp=2024-10-15T12:52:37.518, value=67890
row2 column=C2:customer_name, timestamp=2024-10-15T12:52:37.546, value=Bob
row2 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.573, value=234-567-8901
row3 column=C1:order_date, timestamp=2024-10-15T12:52:37.615, value=2024-10-03
row3 column=C1:order_id, timestamp=2024-10-15T12:52:37.594, value=13579
row3 column=C2:customer_name, timestamp=2024-10-15T12:52:37.632, value=Charlie
row3 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.661, value=345-678-9012
row4 column=C1:order_date, timestamp=2024-10-15T12:52:37.726, value=2024-10-04
row4 column=C1:order_id, timestamp=2024-10-15T12:52:37.701, value=24680
row4 column=C2:customer_name, timestamp=2024-10-15T12:52:37.744, value=David
row4 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.768, value=456-789-0123
row5 column=C1:order_date, timestamp=2024-10-15T12:52:37.810, value=2024-10-05
row5 column=C1:order_id, timestamp=2024-10-15T12:52:37.798, value=11223
row5 column=C2:customer_name, timestamp=2024-10-15T12:53:27.341, value=Eva
row5 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.842, value=567-890-1234
- 扫描整个表的数据,慎用,效率较低。
限定显示条数
语法: scan '表名', {LIMIT => N}
LIMIT => N,N不是表示示例的行数而是rowkey的个数
scan 'ORDER_INFO', {LIMIT => 3}
示例:
hbase:041:0> scan 'ORDER_INFO' , {LIMIT => 3}
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T12:52:37.409, value=12345
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
row2 column=C1:order_date, timestamp=2024-10-15T12:52:37.531, value=2024-10-02
row2 column=C1:order_id, timestamp=2024-10-15T12:52:37.518, value=67890
row2 column=C2:customer_name, timestamp=2024-10-15T12:52:37.546, value=Bob
row2 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.573, value=234-567-8901
row3 column=C1:order_date, timestamp=2024-10-15T12:52:37.615, value=2024-10-03
row3 column=C1:order_id, timestamp=2024-10-15T12:52:37.594, value=13579
row3 column=C2:customer_name, timestamp=2024-10-15T12:52:37.632, value=Charlie
row3 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.661, value=345-678-9012
3 row(s)
- 限定返回的记录条数。
指定查询某些列
语法: scan '表名', {COLUMNS => ['列族:列', ...]}
scan 'ORDER_INFO', {COLUMNS => ['C1:order_id', 'C2:customer_name']}
示例:
hbase:043:0> scan 'ORDER_INFO', {COLUMNS => ['C1:order_id', 'C2:customer_name']}
ROW COLUMN+CELL
row1 column=C1:order_id, timestamp=2024-10-15T12:52:37.409, value=12345
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row2 column=C1:order_id, timestamp=2024-10-15T12:52:37.518, value=67890
row2 column=C2:customer_name, timestamp=2024-10-15T12:52:37.546, value=Bob
row3 column=C1:order_id, timestamp=2024-10-15T12:52:37.594, value=13579
row3 column=C2:customer_name, timestamp=2024-10-15T12:52:37.632, value=Charlie
row4 column=C1:order_id, timestamp=2024-10-15T12:52:37.701, value=24680
row4 column=C2:customer_name, timestamp=2024-10-15T12:52:37.744, value=David
row5 column=C1:order_id, timestamp=2024-10-15T12:52:37.798, value=11223
row5 column=C2:customer_name, timestamp=2024-10-15T12:53:27.341, value=Eva
5 row(s)
Took 0.0310 seconds
- 只扫描指定的列。
根据 RowKey 前缀扫描
语法: scan '表名', {ROWPREFIXFILTER => '前缀'}
scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row1'}
示例:
hbase:044:0> scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row1'}
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T12:52:37.409, value=12345
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0073 seconds
hbase:045:0> scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row'}
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T12:52:37.409, value=12345
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
row2 column=C1:order_date, timestamp=2024-10-15T12:52:37.531, value=2024-10-02
row2 column=C1:order_id, timestamp=2024-10-15T12:52:37.518, value=67890
row2 column=C2:customer_name, timestamp=2024-10-15T12:52:37.546, value=Bob
row2 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.573, value=234-567-8901
row3 column=C1:order_date, timestamp=2024-10-15T12:52:37.615, value=2024-10-03
row3 column=C1:order_id, timestamp=2024-10-15T12:52:37.594, value=13579
row3 column=C2:customer_name, timestamp=2024-10-15T12:52:37.632, value=Charlie
row3 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.661, value=345-678-9012
row4 column=C1:order_date, timestamp=2024-10-15T12:52:37.726, value=2024-10-04
row4 column=C1:order_id, timestamp=2024-10-15T12:52:37.701, value=24680
row4 column=C2:customer_name, timestamp=2024-10-15T12:52:37.744, value=David
row4 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.768, value=456-789-0123
row5 column=C1:order_date, timestamp=2024-10-15T12:52:37.810, value=2024-10-05
row5 column=C1:order_id, timestamp=2024-10-15T12:52:37.798, value=11223
row5 column=C2:customer_name, timestamp=2024-10-15T12:53:27.341, value=Eva
row5 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.842, value=567-890-1234
5 row(s)
Took 0.0292 seconds
- 根据 RowKey 的前缀来扫描表。
添加过滤器
语法: scan '表名', {FILTER => "过滤条件"}
scan 'ORDER_INFO', {FILTER => "SingleColumnValueFilter('C1', 'order_id', =, 'binary:12345')"}
示例:
hbase:046:0> scan 'ORDER_INFO', {FILTER => "SingleColumnValueFilter('C1', 'order_id', =, 'binary:12345')"}
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T12:52:37.409, value=12345
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0614 seconds
hbase:047:0> scan 'ORDER_INFO', {FILTER => "SingleColumnValueFilter('C1', 'order_id', =, 'binary:13579')"}
ROW COLUMN+CELL
row3 column=C1:order_date, timestamp=2024-10-15T12:52:37.615, value=2024-10-03
row3 column=C1:order_id, timestamp=2024-10-15T12:52:37.594, value=13579
row3 column=C2:customer_name, timestamp=2024-10-15T12:52:37.632, value=Charlie
row3 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.661, value=345-678-9012
1 row(s)
Took 0.0067 seconds
删除数据
语法: delete '表名', '行键', '列族:列'
delete 'ORDER_INFO', 'row1', 'C1:order_id'
示例:
hbase:048:0> put 'ORDER_INFO','row1','C1:order_id','11111'
Took 0.0079 seconds
hbase:049:0> scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row1'}
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T13:20:09.379, value=11111
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0123 seconds
hbase:050:0> put 'ORDER_INFO','row1','C1:order_id','22222'
Took 0.0123 seconds
hbase:051:0> scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row1'}
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T13:20:25.601, value=22222
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0295 seconds
hbase:052:0> delete 'ORDER_INFO','row1','C1:order_id'
Took 0.0210 seconds
hbase:053:0> scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row1'}
ROW COLUMN+CELL
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T13:20:09.379, value=11111
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0132 seconds
hbase:054:0>
- 删除指定行中某列的数据。
注意:
- 执行
delete
时,如果当前行有多个版本的数据,它会删除最近的一个版本。- HBase 默认保留每列三个最近的版本。
- 可以通过设置
VERSIONS
属性来控制保留的版本数量。
删除整行
语法: deleteall '表名', '行键'
deleteall 'ORDER_INFO', 'row1'
- 删除指定行的所有数据。
更新数据
- 直接使用
put
命令来覆盖已有值,达到更新的效果。
put 'ORDER_INFO', 'row1', 'C1:order_id', '67890'
计数行数
语法: count '表名'
count 'ORDER_INFO'
示例:
hbase:054:0> count 'ORDER_INFO'
5 row(s)
Took 0.0174 seconds
=> 5
- 统计表中的行数。
(增量计数)INCR 操作
在 HBase 中,可以使用 INCR
操作来创建并累加列值,适用于计数器等场景。
语法格式
incr '表名', '行键', '列族:列限定符', 增量值
表名
:操作的表名称。行键
:指定要操作的行键。列族:列限定符
:指定要操作的列。增量值
:递增的数值,正数表示增加,负数表示减少。
创建与累加操作
-
创建操作:如果指定的列不存在,
INCR
操作会首先创建该列,并将其初始值设置为指定的值(默认是0
),然后执行递增操作。incr 'ORDER_INFO', 'row1', 'C1:order_count', 20
- 对行键为
'row1'
的C1:order_count
列的值设置初始值为20
。
- 对行键为
-
累加操作:当列已经存在时,
INCR
会对现有的值进行累加,增量可以是正数或负数。incr 'ORDER_INFO', 'row1', 'C1:order_count', 1
- 对行键为
'row1'
的C1:order_count
列的值进行递增,增量为1
。
- 对行键为
-
该操作是原子的,适用于高并发环境下的计数需求。
-
如果某一列需要实现累加功能,必须使用
INCR
来创建对应的列。使用PUT
创建的列无法实现累加。
获取计数器的值
-
可以使用
get_counter
指令来获取计数器的值,注意使用get
是无法获取计数器的数据的。get_counter 'ORDER_INFO', 'row1', 'C1:order_count'
示例:
hbase:060:0> incr 'ORDER_INFO', 'row1', 'C1:order_count', 20
COUNTER VALUE = 20
Took 0.0053 seconds
hbase:061:0> scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row1'}
ROW COLUMN+CELL
row1 column=C1:order_count, timestamp=2024-10-15T13:25:40.085, value=\x00\x00\x00\x00\x00\x00\x00(
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T13:20:09.379, value=11111
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0056 seconds
hbase:062:0> scan 'ORDER_INFO', {ROWPREFIXFILTER => 'row1' , FORMATTER => 'toString'}
ROW COLUMN+CELL
row1 column=C1:order_count, timestamp=2024-10-15T13:25:40.085, value=(
row1 column=C1:order_date, timestamp=2024-10-15T12:52:37.435, value=2024-10-01
row1 column=C1:order_id, timestamp=2024-10-15T13:20:09.379, value=11111
row1 column=C2:customer_name, timestamp=2024-10-15T12:52:37.460, value=Alice
row1 column=C2:customer_phone, timestamp=2024-10-15T12:52:37.486, value=123-456-7890
1 row(s)
Took 0.0089 seconds
hbase:063:0> get_counter 'ORDER_INFO' , 'row1','C1:order_count'
COUNTER VALUE = 20
Took 0.0043 seconds
hbase:064:0>
查看表的所有快照
语法: list_snapshots
list_snapshots
示例:
hbase:065:0> list_snapshots
SNAPSHOT TABLE + CREATION TIME + TTL(Sec)
ORDER_INFO_SNAPSHOT ORDER_INFO (2024-10-15 14:13:25 +0800) FOREVER
1 row(s)
Took 0.0487 seconds
=> ["ORDER_INFO_SNAPSHOT"]
hbase:066:0
- 列出所有的 HBase 表快照。
创建快照
语法: snapshot '表名', '快照名'
snapshot 'ORDER_INFO', 'ORDER_INFO_SNAPSHOT'
# create_snapshot 'ORDER_INFO', 'ORDER_INFO_SNAPSHOT'
# 这种创建也行
示例:
hbase:064:0> snapshot 'ORDER_INFO', 'ORDER_INFO_SNAPSHOT'
Took 2.4963 seconds
hbase:065:0> list_snapshots
SNAPSHOT TABLE + CREATION TIME + TTL(Sec)
ORDER_INFO_SNAPSHOT ORDER_INFO (2024-10-15 14:13:25 +0800) FOREVER
1 row(s)
Took 0.0487 seconds
=> ["ORDER_INFO_SNAPSHOT"]
hbase:066:0>
- 创建表的快照,作为表当前状态的备份。
使用快照
语法: clone_snapshot '快照名', '表名'
clone_snapshot 'ORDER_INFO_SNAPSHOT', 'CLONE_ORDER_INFO'
示例:
hbase:001:0> clone_snapshot 'ORDER_INFO_SNAPSHOT', 'CLONE_ORDER_INFO'
Took 2.5459 seconds
hbase:002:0> list
TABLE
CLONE_ORDER_INFO
ORDER_INFO
2 row(s)
Took 0.0142 seconds
=> ["CLONE_ORDER_INFO", "ORDER_INFO"]
hbase:003:0>
通过快照恢复数据
语法: restore_snapshot '快照名'
会直接作用在所拍快照的表中
restore_snapshot 'ORDER_INFO_SNAPSHOT'
- 注意:恢复时,表需要先被禁用,可以使用如下命令:
disable 'ORDER_INFO'
restore_snapshot 'ORDER_INFO_SNAPSHOT'
示例:
hbase:007:0> restore_snapshot 'ORDER_INFO_SNAPSHOT'
ERROR: Table ORDER_INFO should be disabled!
For usage try 'help "restore_snapshot"'
Took 0.0209 seconds
hbase:008:0> disable 'ORDER_INFO'
Took 0.3326 seconds
hbase:009:0> restore_snapshot 'ORDER_INFO_SNAPSHOT'
Took 0.2898 seconds
hbase:010:0>
删除快照
语法: delete_snapshot '快照名'
delete_snapshot 'ORDER_INFO_SNAPSHOT'
- 删除指定的快照。
快照导出
- 如果需要将快照从一个集群导出到另一个集群,可以使用 ExportSnapshot 工具:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ORDER_INFO_SNAPSHOT -copy-to hdfs://mycluster/hbaseSnapshot -mappers 4
-snapshot:要导出的快照名称。 -copy-to:目标集群的 HDFS 路径。 -mappers:指定并行执行的 mapper 数量。
示例:
(base) root@hadoop-master1:/opt/hbase/bin# ./hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ORDER_INFO_SNAPSHOT -copy-to hdfs://mycluster/hbaseSnapshot -mappers 4
2024-10-15T14:43:57,762 INFO [main] snapshot.ExportSnapshot: Verify the source snapshot's expiration status and integrity.
2024-10-15T14:44:00,202 INFO [main] snapshot.ExportSnapshot: Copy Snapshot Manifest from hdfs://mycluster/hbase/.hbase-snapshot/ORDER_INFO_SNAPSHOT to hdfs://mycluster/hbaseSnapshot/.hbase-snapshot/.tmp/ORDER_INFO_SNAPSHOT
2024-10-15T14:44:01,187 INFO [main] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2024-10-15T14:44:01,290 INFO [main] mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1728971830625_0001
2024-10-15T14:44:02,566 INFO [main] snapshot.ExportSnapshot: Loading Snapshot 'ORDER_INFO_SNAPSHOT' hfile list
2024-10-15T14:44:02,657 INFO [main] mapreduce.JobSubmitter: number of splits:3
2024-10-15T14:44:02,808 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1728971830625_0001
2024-10-15T14:44:02,811 INFO [main] mapreduce.JobSubmitter: Executing with tokens: []
2024-10-15T14:44:05,190 INFO [main] conf.Configuration: resource-types.xml not found
2024-10-15T14:44:05,190 INFO [main] resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-10-15T14:44:05,671 INFO [main] impl.YarnClientImpl: Submitted application application_1728971830625_0001
2024-10-15T14:44:05,727 INFO [main] mapreduce.Job: The url to track the job: http://hadoop-master2:8088/proxy/application_1728971830625_0001/
2024-10-15T14:44:05,727 INFO [main] mapreduce.Job: Running job: job_1728971830625_0001
2024-10-15T14:44:15,854 INFO [main] mapreduce.Job: Job job_1728971830625_0001 running in uber mode : false
2024-10-15T14:44:15,855 INFO [main] mapreduce.Job: map 0% reduce 0%
2024-10-15T14:44:25,930 INFO [main] mapreduce.Job: map 33% reduce 0%
2024-10-15T14:44:26,933 INFO [main] mapreduce.Job: map 67% reduce 0%
2024-10-15T14:44:27,936 INFO [main] mapreduce.Job: map 100% reduce 0%
2024-10-15T14:44:28,942 INFO [main] mapreduce.Job: Job job_1728971830625_0001 completed successfully
2024-10-15T14:44:28,992 INFO [main] mapreduce.Job: Total time spent by all maps in occupied slots (ms)=26453
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=26453
Total vcore-milliseconds taken by all map tasks=26453
Total megabyte-milliseconds taken by all map tasks=27087872
Map-Reduce Framework
Map input records=3
Map output records=0
Input split bytes=594
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=233
CPU time spent (ms)=2660
Physical memory (bytes) snapshot=1093185536
Virtual memory (bytes) snapshot=7706193920
Total committed heap usage (bytes)=1143996416
Peak Map Physical memory (bytes)=372260864
Peak Map Virtual memory (bytes)=2569465856
org.apache.hadoop.hbase.snapshot.ExportSnapshot$Counter
BYTES_COPIED=15707
BYTES_EXPECTED=15707
BYTES_SKIPPED=0
COPY_FAILED=0
FILES_COPIED=3
FILES_SKIPPED=0
MISSING_FILES=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2024-10-15T14:44:28,993 INFO [main] snapshot.ExportSnapshot: Finalize the Snapshot Export
2024-10-15T14:44:29,014 INFO [main] snapshot.ExportSnapshot: Verify the exported snapshot's expiration status and integrity.
2024-10-15T14:44:29,772 INFO [main] snapshot.ExportSnapshot: Export Completed: ORDER_INFO_SNAPSHOT
合并区域(Regions)
语法: merge_region 'region1', 'region2'
- 合并两个指定的 Region。
- 需先通过
list_regions '表名'
找到具体的 Region 名称。
分裂区域(Regions)
语法: split '表名', '分裂键'
split 'ORDER_INFO', 'row3'
- 将表按照指定的行键进行分裂,用于数据均衡。
major_compact
语法: major_compact '表名'
- 对指定表进行 major compaction,合并所有存储文件。
minor_compact
语法: compact '表名'
- 对指定表进行 minor compaction,合并部分存储文件,释放 HFile。
权限管理
赋予权限: grant '用户', '权限', '表名', '列族', '列'
grant 'admin', 'RWXCA', 'ORDER_INFO'
- 给用户赋予读写、执行等权限。
收回权限: revoke '用户', '表名', '列族', '列'
revoke 'admin', 'ORDER_INFO'
- 收回用户权限。
备份与恢复
- 可以使用
exportSnapshot
和restore_snapshot
工具来备份与恢复表数据。
执行 Command 文件
- 使用 HBase Shell 运行上传的 command 文件。
hbase shell /path/to/command-file.txt
- 确保文件中包含合法的 HBase Shell 命令。
导入导出数据
语法: 使用 bulkload
工具。
importtsv
可以用于将 TSV 格式的数据文件导入 HBase 表。export
可以将表中的数据导出为 HDFS 中的文本文件。
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.separator=',' \
-Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,cf1:age,cf1:city,cf1:phone,cf1:email,cf2:occupation,cf2:company,cf2:salary,cf2:experience,cf2:department,cf3:hobby,cf3:favorite_color,cf3:sport,cf3:pet,cf3:music,cf4:address,cf4:zipcode,cf4:state,cf4:country,cf4:continent,cf5:social_media,cf5:website,cf5:blog,cf5:subscribed,cf5:membership \
USER_INFO /hbasedata/hbase_large_million_dataset.csv
示例
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,C1:order_id,C2:customer_name 'ORDER_INFO' /path/to/data.tsv
大量数据的计数统计
对于大规模数据集,可以使用 MapReduce 任务来对表中的行数进行统计,以提高效率。例如使用 rowcounter
工具。
hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'ORDER_INFO'
过滤器
在 HBase 中,过滤器用于限制扫描或获取数据时返回的结果集,帮助提高查询效率,减少不必要的数据传输。
语法:
- 其实在hbase shell中,执行的ruby脚本,背后还是调用hbase提供的Java API
- 在HBase中有很多的多过滤器,语法格式看起来会比较复杂,所以重点理解这个语法是什么意思
- 过滤器在hbaseshell中是使用一个表达式来描述,在Java里面new各种对象
解释:
scan "ORDER_INFO" , {FILTER => "RowFilter(=,'binary:02602f66-adc7-40d4-8485-76b5632b5b53')",COLUMNS => ['C1:STATUS' 'C1:PAYWAY'], FORMATTER =>'toString'}
- “RowFilter(=,"binary:02602f66-adc7-40d4-8485-76b5632b5b53')”这个就是一个表达式
- RowFilter就是JavaAPi中Filter的构造器名称
- 可以理解为RowFilter0就是创建一个过滤器对象
- =是JRuby一个特殊记号,表示是一个比较运算符,还可以是>、<、>=...
- binary:02602f66-adc7-4dd4-8485-76b5632b5b53是一个比较器的表达式,为了方便大家理解,可以将比较器理解为值,binary:xxxox表示直接和值进行毕节
按行键(RowKey)过滤器
-
PrefixFilter:根据行键的前缀进行过滤。
scan 'ORDER_INFO', {FILTER => "PrefixFilter('row1')"}
- 只返回行键以
'row1'
开头的行。
- 只返回行键以
-
RowFilter:基于行键的比较进行过滤。
scan 'ORDER_INFO', {FILTER => "RowFilter(=, 'binary:row1')"}
- 只返回行键等于
'row1'
的行。
- 只返回行键等于
-
InclusiveStopFilter:扫描到指定的行键时停止。
scan 'ORDER_INFO', {FILTER => "InclusiveStopFilter('row3')"}
- 扫描数据,直到行键为
'row3'
时停止。
- 扫描数据,直到行键为
-
RandomRowFilter:随机返回部分行数据。
scan 'ORDER_INFO', {FILTER => "RandomRowFilter(0.5)"}
- 以 50% 的概率返回表中的行数据。
列过滤器
-
SingleColumnValueFilter:根据指定列的值进行过滤。
scan 'ORDER_INFO', {FILTER => "SingleColumnValueFilter('C1', 'order_id', =, 'binary:12345')"}
- 只返回
order_id
列值为12345
的行。
- 只返回
-
ColumnPrefixFilter:根据列名前缀进行过滤。
scan 'ORDER_INFO', {FILTER => "ColumnPrefixFilter('order')"}
- 只返回列名前缀为
'order'
的列。
- 只返回列名前缀为
-
QualifierFilter:基于列限定符(Qualifier)的比较进行过滤。
scan 'ORDER_INFO', {FILTER => "QualifierFilter(=, 'binary:order_id')"}
- 只返回列限定符等于
'order_id'
的列。
- 只返回列限定符等于
-
FamilyFilter:基于列族的比较进行过滤。
scan 'ORDER_INFO', {FILTER => "FamilyFilter(=, 'binary:C1')"}
- 只返回列族等于
'C1'
的数据。
- 只返回列族等于
-
DependentColumnFilter:当指定列存在时,才返回整行数据。
scan 'ORDER_INFO', {FILTER => "DependentColumnFilter('C1', 'order_id')"}
- 只返回包含
C1:order_id
列的行。
- 只返回包含
其他类型过滤器
-
PageFilter:用于分页查询,限制返回的行数。
scan 'ORDER_INFO', {FILTER => "PageFilter(10)"}
- 只返回前 10 行数据。
-
ValueFilter:根据列值进行过滤。
scan 'ORDER_INFO', {FILTER => "ValueFilter(=, 'binary:12345')"}
- 只返回值为
12345
的列。
- 只返回值为
-
TimestampsFilter:根据时间戳进行过滤。
scan 'ORDER_INFO', {FILTER => "TimestampsFilter([1631022245123, 1631022245124])"}
- 只返回匹配指定时间戳的数据。
-
KeyOnlyFilter:只返回行键,不返回列值。
scan 'ORDER_INFO', {FILTER => "KeyOnlyFilter()"}
- 只返回行键,用于仅检查行存在与否。
-
SkipFilter:跳过包含特定条件的行。
scan 'ORDER_INFO', {FILTER => "SkipFilter(SingleColumnValueFilter('C1', 'order_id', =, 'binary:12345'))"}
- 跳过
order_id
等于'12345'
的行。
- 跳过
-
FirstKeyOnlyFilter:每行只返回第一个键值对。
scan 'ORDER_INFO', {FILTER => "FirstKeyOnlyFilter()"}
- 用于只获取每行的第一个键值对,通常用于行计数。
组合过滤器
可以使用 FilterList
组合多个过滤器。
scan 'ORDER_INFO', {FILTER => "FilterList(AND, SingleColumnValueFilter('C1', 'order_id', =, 'binary:12345'), PrefixFilter('row1'))"}
- 组合使用多个过滤条件,返回符合所有条件的行。
- 可以使用
AND
或OR
逻辑操作符来控制组合过滤器的行为。