اضافه شدن ویژگی ساسی (SASI) در کاساندرا به منظور Full Search

1,352 تعداد نمایش

نمایه گذاری ساسی (SASI): از نسخه 3.4، میتوان از پیاده سازی جدیدی از اندیسهای ثانویه SSTable Attached Secondary Index (SASI) بهره برد. برای ستونهایی که توسط پیاده سازی ساسی (SASI)، اندیس ثانویه میشوند، میتوان در پرسوجوها از عملگرهای نامساوی (پرسوجوی محدودهای از مقادیر) و LIKE (مانند SQL) استفاده کرد. همچنین در این نوع پیاده سازی، در پرسوجوهایی که نیاز به پالایش دارند (Allow filtering)، کارایی قابل توجهی حاصل میشود. در این پیاده سازی تا به امروز (نسخه 3.9) نمیتوان Collection ها را اندیس نمود. سه روش متفاوت برای این نوع ایندکس در کاساندرا ارائه شده است که عناوین آنها به این ترتیب است.

حالت PREFIX index
حالت CONTAINS index
حالت SPARSE index

من دیگر حرفی ندارم… با اضافه شدن این قابلیت به کاساندرا اولین چیزی که به ذهن می آید این است که بزودی کاساندرا هم میخواهد جای فول تکست سرچرها را بگیرد و هم میخواهد جایگزین بانک های اطلاعاتی رابطه ای بشود و هم میخواهد به عنوان پایگاه داده های سندگرا ایفای نقش کند. فعلا وقت ندارم ترجمه کنم ولی خودتان یک نگاه به منبع آخر مطلب بکنید میتوانید متوجه بشوید چه خبر است.

Using a SSTable Attached Secondary Index (SASI)
Secondary indexes, SSTable Attached Secondary Indexes (SASI), have improved the performance of secondary indexes but should be used with caution.
Note: SASI indexes in DSE are experimental. DataStax does not support SASI indexes for production.
Using CQL, SSTable attached secondary indexes (SASI) can be created on a non-collection column defined in a table. Secondary indexes are used to query a table that uses a column that is not normally queryable, such as a non primary key column. SASI implements three types of indexes, PREFIX, CONTAINS, and SPARSE.

حالت PREFIX index:

Create an index fn_prefix for the table cyclist_name on the column firstname. PREFIX is the default mode, so it does not need to be specified.
CREATE TABLE cycling.cyclist_name (
id UUID PRIMARY KEY,
lastname text,
firstname text
);

CREATE CUSTOM INDEX fn_prefix ON cyclist_name (firstname) USING 'org.apache.cassandra.index.sasi.SASIIndex';
SELECT * FROM cycling.cyclist_name;

Queries can find exact matches for values in firstname. Note that indexing is used for this query, as the primary key id is not specified.

SELECT * FROM cyclist_name WHERE firstname = 'Marianne';

Queries can find matches for values in firstname based on partial matches. The use of LIKE specifies that the match is looking for a word that starts with the letter “M”. The % after the letter “M” will match any characters can return a matching value. Note that indexing is used for this query, as the primary key id is not specified.

SELECT * FROM cyclist_name WHERE firstname LIKE 'M%';

تمامی پرسوجو های زیر با خطا مواجه خواهند شد.

SELECT * FROM cyclist_name WHERE firstname = 'MARIANNE';
SELECT * FROM cyclist_name WHERE firstname LIKE 'm%';
SELECT * FROM cyclist_name WHERE firstname LIKE '%m%';
SELECT * FROM cyclist_name WHERE firstname LIKE '%m%' ALLOW FILTERING;
SELECT * FROM cyclist_name WHERE firstname LIKE '%M%';
SELECT * FROM cyclist_name WHERE firstname = 'M%';
SELECT * FROM cyclist_name WHERE firstname = '%M';
SELECT * FROM cyclist_name WHERE firstname = '%M%';
SELECT * FROM cyclist_name WHERE firstname = 'm%';

The first four queries fail because of case sensitivity. “MARIANNE” is all uppercase, whereas the stored value is not. The next three use a lowercase “m”. The placement of the % are critical; since the index specifies the PREFIX mode, only a trailing % will yield results when coupled with LIKE. The queries with equalities fail unless the exact match is designated.

حالت CONTAINS index:

Create an index fn_suffix for the table cyclist_name on the column firstname. CONTAINS is the specified mode, so that pattern matching for partial patterns given, not just in the prefix.

CREATE CUSTOM INDEX fn_contains ON cyclist_name (firstname) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'CONTAINS' };

Queries can find exact matches for values in firstname. Note that indexing is used for this query, as the primary key id is not specified. For queries on CONTAINS indexing, the ALLOW FILTERING phrase must be included, although the database will not actually filter.

SELECT * FROM cyclist_name WHERE firstname = 'Marianne' ALLOW FILTERING;

This query returns the same results as a query using PREFIX indexing that does an exact match using a slightly modified query.
Queries can find matches for values in firstname based on partial matches. The use of LIKE specifies that the match is looking for a word that contains the letter “M”. The % before and after the letter “M” will match any characters can return a matching value. Note that indexing is used for this query, as the primary key id is not specified.

?SELECT * FROM cyclist_name WHERE firstname LIKE '%M%';

Again, the same results are returned as for the PREFIX indexing, using a slightly modified query.
The CONTAINS indexing has a more versatile matching algorithm than PREFIX. Look at the examples below to see what results from variations of the last search.

SELECT * FROM cyclist_name WHERE firstname LIKE '%arianne';
SELECT * FROM cyclist_name WHERE firstname LIKE '%arian%';

Each query matches the pattern, either the final characters of the column value as in %arianne or the characters bracketed by % such as %arian%.
With CONTAINS indexing, even inequality pattern matching is possible. Note again the use of the ALLOW FILTERING phrase that required but causes no latency in the query response.

SELECT * FROM cyclist_name WHERE firstname > 'Mar' ALLOW FILTERING;

The only row matching the conditions returns the same value as the last query.

Like with PREFIX indexing, many queries will fail to find matches based on the partial string.

تمامی پرسوجو های زیر با خطا مواجه خواهند شد.

SELECT * FROM cyclist_name WHERE firstname = 'Marianne';
SELECT * FROM cyclist_name WHERE firstname = 'MariAnne' ALLOW FILTERING;
SELECT * FROM cyclist_name WHERE firstname LIKE '%m%';
SELECT * FROM cyclist_name WHERE firstname LIKE 'M%';
SELECT * FROM cyclist_name WHERE firstname LIKE '%M';
SELECT * FROM cyclist_name WHERE firstname LIKE 'm%';

The first query fails due to the absence of the ALLOW FILTERING phrase. The next two queries fail because of case sensitivity. “MariAnne” has one uppercase letter, whereas the stored value does not. The last three fail due to placement of the %.

Either the PREFIX index or the CONTAINS index can be created with case sensitivity by adding an analyzer class and case_sensitive option.

CREATE CUSTOM INDEX fn_suffix_allcase ON cyclist_name (firstname) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};

The analyzer_class used here is the non-tokenizing analyzer that does not perform analysis on the text in the specified column. The option case_sensitive is set to false to make the indexing case insensitive.
With the addition of the analyzer class and option, the following query now also works, using a lowercase “m”.

SELECT * FROM cyclist_name WHERE firstname LIKE '%m%';

If queries are narrowed with an indexed column value, non-indexed columns can be specified. Compound queries can also be created with multiple indexed columns. This example alters the table to add a column age that is not indexed before performing the query.

ALTER TABLE cyclist_name ADD age int;
UPDATE cyclist_name SET age=23 WHERE id=5b6962dd-3f90-4c93-8f61-eabfa4a803e2;
INSERT INTO cyclist_name (id,age,firstname,lastname) VALUES (8566eb59-07df-43b1-a21b-666a3c08c08a,18,'Marianne','DAAE');
SELECT * FROM cyclist_name WHERE firstname='Marianne' and age > 20 allow filtering;

حالت SPARSE index:

The SPARSE index is meant to improve performance of querying large, dense number ranges like timestamps for data inserted every millisecond. If the data is numeric, millions of columns values with a small number of partition keys characterize the data, and range queries will be performed against the index, then SPARSE is the best choice. For numeric data that does not meet this criteria, PREFIX is the best choice.

CREATE CUSTOM INDEX fn_contains ON cyclist_name (age) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };

Use SPARSE indexing for data that is sparse (every term/column value has less than 5 matching keys). Indexing the created_at field in time series data (where there is typically few matching rows/events per created_at timestamp) is a good use case. SPARSE indexing is primarily an optimization for range queries, especially large ranges that span large timespans.

To illustrate the use of the SPARSE index, create a table and insert some time series data:

CREATE TABLE cycling.comments (commenter text, created_at timestamp, comment text, PRIMARY KEY (commenter));
INSERT INTO cycling.comments (commenter, comment, created_at) VALUES ('John', 'Fantastic race!', '2013-01-01 00:05:01.500');
INSERT INTO cycling.comments (commenter, comment, created_at) VALUES ('Jane', 'What a finish', '2013-01-01 00:05:01.400');
INSERT INTO cycling.comments (commenter, comment, created_at) VALUES ('Mary', 'Hated to see the race end.', '2013-01-01 00:05:01.300');
INSERT INTO cycling.comments (commenter, comment, created_at) VALUES ('Jonnie', 'Thankfully, it is over.', '2013-01-01 00:05:01.600');

Find all the comments made before the timestamp 2013-01-01 00:05:01.500.

SELECT * FROM cycling.comments WHERE created_at < '2013-01-01 00:05:01.500';

This query returns all the results where created_at is found to be less than the timestamp supplied. The inequalities >=, > and <= are all valid operators.
SPARSE indexing is used only for numeric data, so LIKE queries do not apply.

Using analyzers

Analyzers can be specified that will analyze the text in the specified column. The NonTokenizingAnalyzer is used for cases where the text is not analyzed, but case normalization or sensitivity is required. The StandardAnalyzer is used for analysis that involves stemming, case normalization, case sensitivity, skipping common words like “and” and “the”, and localization of the language used to complete the analysis. Altering the table again to add a lengthier text column provides a window into the analysis.

ALTER TABLE cyclist_name ADD comments text;

UPDATE cyclist_name SET comments ='Rides hard, gets along with others, a real winner' WHERE id = fb372533-eb95-4bb4-8685-6ef61e994caa;

UPDATE cyclist_name SET comments ='Rides fast, does not get along with others, a real dude' WHERE id = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2;


CREATE CUSTOM INDEX stdanalyzer_idx ON cyclist_name (comments) USING 'org.apache.cassandra.index.sasi.SASIIndex'

WITH OPTIONS = {

'mode': 'CONTAINS',

'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',

'analyzed': 'true',

'tokenization_skip_stop_words': 'and, the, or',

'tokenization_enable_stemming': 'true',

'tokenization_normalize_lowercase': 'true',

'tokenization_locale': 'en'

};

This query will search for the presence of a designated string, using the analyzed text to return a result.

SELECT * FROM cyclist_name WHERE comments LIKE 'ride';

This query returns all the results where ride is found either as an exact word or as a stem for another word – rides in this case.

منبع:

http://pp1230.github.io/2015/12/26/cassandra-lucene-index.html

https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndex.html

Stratio’s Lucene-based index for Cassandra is now a plugin

آدرس کانال تلگرام سایت بیگ دیتا:

t.me/bigdata_channel

آدرس کانال سروش ما:
https://sapp.ir/bigdata_channel

جهت دیدن سرفصل های دوره های آموزشی بر روی اینجا کلیک کنید.

Visits: 1634

خانه بیگ دیتا تحلیل شبکه های اجتماعی، متن کاوی، داده کاوی، اوسینت و داده های حجیم

اضافه شدن ویژگی ساسی (SASI) در کاساندرا به منظور Full Search

حالت PREFIX index:

حالت CONTAINS index:

حالت SPARSE index:

نوشته های مرتبط

همچنین ببینید

ویژگی نما های از پیش تولید شده یا Materialized view در کاساندرا

فیلم آموزش آپاچی اسپارک به زبان ساده از شرکت لیندا

دیدگاهتان را بنویسید