Subscribe2

Subscriber Count

    414

Performance Troubleshooting : DB File Sequential Read When performing Full Table Scan

Hello All,

What wait event you observe when Oracle Performs a full table scan?

  • DB File Scattered Read
  • Direct Path Reads (from 11g onwards under certain conditions)

What wait event you observe when Oracle Performs a table via index scan?

  • DB File Sequential Read

While we are working a issue, we apparently found an issue that a full tablescan of 17gb table performing a db file sequential read. 

This post is not how to resolve the issue (that's need a separate post), but to  know what is the reason that Oracle performs the db file sequential read when doing a full tablescan. The following are best possible cases.

Case Reason Description

Case 1

Chained Rows

chained rows

 

Case 2

Table Size just large above mbrc

For example, if mbrc is 8 and table has 10 blocks, then oracle do one db file scattered read and 2 sequential reads.

Case 3

Columns  more than 255

Oracle fits the rows into different blocks of those columns beyond 255. So there is a chaining of rows , so when scan happens it has to fetch from different blocks, so indeed a sequential scan with in a multiblock scan, to know this, check v$sesstat for statistics "table fetch continued row"

Case 4

IOT Table

When having IOT table it can appear db file sequential read

Case 5

MView Table

When having Mview the mview with index may show db file sequential read

Case 6

Extent & Block Mapping

This occurs when Oracle is unable to read DBFMBRC worth of blocks

from a given extent. This usually happens towards the edge of the

high-water-mark of the table, but in theory can happen in any extent of the

table. 

Case 7

Cached Blocks

When certain blocks are already in cache from a table oracle will not read them from disk again, in those cases it do partial scans of mbrc value like

block 9813, 9814 and 9815 - scattered read (MBR) with MBRC=3
block 9816 - cache read
block 9817 and 9818 - scattered Oracle read with MBRC=2
block 9819 - cache read
block 9820 - sequential Oracle read
block 9821 - cache read
block 9822 - sequential Oracle read
In this case you got 2 MBR + 2 SBR + 3 cache reads instead of 1 MBR with MBRC=10 (or 16).

Seems the marked red was the ones we have encountered as the table has 299 columns and the extents allocation shows more varying sizes (not rounded to block sizes of mbrc) 

(The last two cases found in internet from Gaza Vaidyanth and Other's)

Hope this helps

-Thanks

Geek

Quick Question : Finding Transactions on a Table

Folks! this subject line may sound silly, but most of us really think how for a moment.

  • Business Requirement is some want to add a column to high volatile table, So they want to understand,
  • Does it cause issue as its high volatile and access by Website?
  • What is the best time to add a column if so it can issue?
  • What is alternate method to minimise the impact if any?

Ans 1: You can add a column, that's not an issue.

Ans2: Adding a column to highly volatile table will not have any impact unless if there are transactions running and uncommitted your DDL may wait and complete thereafter. There's a caveat here, adding a column with null will complete quickly than adding a column with some default value. The logic behind is the new column value must be updated in all the old rows and if the table is huge it will take time and eventually your transactions may piled up as such DDL lock will protect the transactions to be start.

   

Ans3: Best time to add a column? First find how many transactions happening on that table and if audit enabled find when was least transactions happens

SQL> select * from dba_tab_modifications where table_name in ('XXXXXXXXXXXXXXX');

SCHEMA             TABLE                          INSERTS UPDATES DELETES

PORTAL             XXXXXXXXXXXXXXXXXXXX           16415   16415    9964 

Then, Find from audit trail at what time most of the transactions fired on those tables. ( I am using xml auditing hence v$xml_audit_trail , you can use DBA_AUDIT_TRAIL instead), see at 04AM have very less transactions on the table. Establish a pattern of each day then find out when is the less transactional time for that table.

SQL> select object_name,to_char(EXTENDED_TIMESTAMP,'DD-MON-YY HH24'),count(*) from v$xml_audit_trail  where object_name in ('XXXXXXXXXXXXXXX') group by object_name,to_char(EXTENDED_TIMESTAMP,'DD-MON-YY HH24');

TABLENAME                            DATE HR            Count

--------------------------------  -----------     ---------

XXXXXXXXXXXXXXXXXXXXXXXX10-AUG-16 22483

XXXXXXXXXXXXXXXXXXXXXXXX10-AUG-16 23626

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 00736

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 01962

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 02183

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 03332

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 0466

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 05333

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 07162

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 0862

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 09894

 

XXXXXXXXXXXXXXXXXXXXXXXX11-AUG-16 10717

ANS 4: Alternative method? for Adding a column without impact? Scratching my head really 🙂 , Do you know any? Apart from keeping a replicated table and rename (still it will lock the table)

Geek DBA

Cassandra for Oracle DBA’s Part 7 – Adding & Deleting Nodes

Adding a Node, is straight forward,

1. Generate a token list by using script and so you can get the token range for new node.

2. Download the cassandra software, unpack it and change the cassandra.yaml of three important following parameters

cluster_name: 'geek_cluster'

seeds: "127.0.0.1, 127.0.0.2,127.0.0.3"

listen_address: 127.0.0.4

 

rpc_address: 127.0.0.4

token: 

 

3. Start the Cassandra

$CASSANDRA_HOME/bin/cassandra -f

Now when the new node bootstraps to cluster, there's the behind the scenes that start working, data rebalance.

If you recollect the ASM Disk operations, when you add / delete the disk at diskgroup level, the existing data should be rebalanced to other or to new disks with in the diskgroup. similarly cassandra does the same but at node level with token range that node have.

So with three nodes, the ownership data shows 33.3% of data its own, 

root@wash-i-16ca26c8-prod ~/.ccm $ ccm node1 nodetool ring

Note: Ownership information does not include topology; for complete information, specify a keyspace

Datacenter: datacenter1

Address         Rack        Status State   Load            Owns                Token

                                                                                                      3074457345618258602

127.0.0.1       rack1       Up     Normal  24.84 KB        33.33%             -9223372036854775808

127.0.0.2       rack1       Up     Normal  24.8 KB         33.33%              -3074457345618258603

127.0.0.3       rack1       Up     Normal  24.87 KB        33.33%              3074457345618258602

Added a node with CCM rather manually,

ccm add --itf 127.0.0.4 --jmx-port 7400 -b node4

Check the status again , after a nodetool repair

root@wash-i-16ca26c8-prod ~/.ccm $ ccm node1 nodetool ring

Note: Ownership information does not include topology; for complete information, specify a keyspace

Datacenter: datacenter1

==========

Address         Rack        Status State   Load            Owns                Token

                                                                                                      3074457345618258602

127.0.0.1       rack1       Up     Normal  43.59 KB        33.33%              -9223372036854775808

127.0.0.4       rack1       Up     Normal  22.89 KB        16.67%             -6148914691236517206

127.0.0.2       rack1       Up     Normal  48.36 KB        16.67%             -3074457345618258603

127.0.0.3       rack1       Up     Normal  57.37 KB        33.33%              3074457345618258602

As you see, with three nodes the each own 33%, where with four nodes two nodes have rebalanced it data of 16.67% each due to new token range assigned to it.

This way node additions/deletions would not have an impact of data loss since the rebalance operation is online and behind the scenes as like ASM.

While doing rebalancing one can check the following to understand how much completed and how much pending, as like v$asm_operation.

root@wash-i-16ca26c8-prod ~/.ccm/repository/4.5.2/demos/portfolio_manager/bin $ ccm node1 nodetool netstats

Mode: NORMAL

Not sending any streams.

Read Repair Statistics:

Attempted: 0

Mismatch (Blocking): 0

Mismatch (Background): 0

Pool Name                    Active   Pending      Completed

Commands                        n/a         0         140361

Responses                       n/a         0         266253

If a node is leaving from the cluster, this will also visible with nodetool netstats command

root@wash-i-16ca26c8-prod ~/.ccm/repository/4.5.2/demos/portfolio_manager/bin $ ccm node4 nodetool netstats

Mode: LEAVING

Not sending any streams.

Read Repair Statistics:

Attempted: 0

Mismatch (Blocking): 0

Mismatch (Background): 0

Pool Name                    Active   Pending      Completed

Commands                        n/a         0            159

Responses                       n/a         0         238788

Further, to delete a node, nodetool decommission should be used rather remove since remove directly drop the node and delete data without rebalance. Here i directly removed the node4

root@wash-i-16ca26c8-prod ~ $ ccm node4 remove

Status shows only three nodes are up,

root@wash-i-16ca26c8-prod ~ $ ccm status

Cluster: 'geek_cluster'

node1: UP

node3: UP

node2: UP

root@wash-i-16ca26c8-prod ~ $ ccm node1 nodetool status

Note: Ownership information does not include topology; for complete information, specify a keyspace

Datacenter: Cassandra

=====================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address    Load       Tokens  Owns   Host ID                               Rack

UN  127.0.0.1  2.09 MB    1       33.3%  1dc82d65-f88d-4b79-9c1b-dc5aa2a55534  rack1

UN  127.0.0.2  3.02 MB    1       23.6%  ab247945-5989-48f3-82b3-8f44a3aaa375  rack1

UN  127.0.0.3  3.22 MB    1       33.3%  023a4514-3a74-42eb-be49-feaa69bf098c  rack1

DN  127.0.0.4  3.39 MB    1       9.8%   9d5b4aee-6707-4639-a2d8-0af000c25b45  rack1

See the Node 4 status showing DN, and it holds 9.8% of data which seems to be lost due to direct remove command, and I stopped the Cassandra and started again, and here is the result.

root@wash-i-16ca26c8-prod ~/.ccm/repository/4.5.2/resources/cassandra/conf $ ccm start

[node1 ERROR] org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.io.compress.CorruptBlockException: (/var/lib/cassandra/data/system/local/system-local-jb-93-Data.db): corruption detected, chunk at 0 of length 261.

Tried to do the nodetool repair, to repair the data whilst it wont allowed to do the repair on node4

root@wash-i-16ca26c8-prod ~/.ccm/repository/4.5.2/bin $ ccm node1 nodetool repair

Traceback (most recent call last):

  File "/usr/local/bin/ccm", line 86, in <module>

    cmd.run()

  File "/usr/local/lib/python2.7/site-packages/ccmlib/cmds/node_cmds.py", line 267, in run

    stdout, stderr = self.node.nodetool(" ".join(self.args[1:]))

  File "/usr/local/lib/python2.7/site-packages/ccmlib/dse_node.py", line 264, in nodetool

    raise NodetoolError(" ".join(args), exit_status, stdout, stderr)

ccmlib.node.NodetoolError: Nodetool command '/root/.ccm/repository/4.5.2/bin/nodetool -h localhost -p 7100 repair' failed; exit status: 1; stdout: [2016-07-13 01:19:08,567] Nothing to repair for keyspace 'system'

[2016-07-13 01:19:08,573] Starting repair command #1, repairing 2 ranges for keyspace PortfolioDemo

[2016-07-13 01:19:10,719] Repair session cc495b80-4897-11e6-9deb-e7c99fc0dbe2 for range (-3074457345618258603,3074457345618258602] finished

[2016-07-13 01:19:10,720] Repair session cd8beda0-4897-11e6-9deb-e7c99fc0dbe2 for range (3074457345618258602,-9223372036854775808] failed with error java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.4) is dead: session failed

[2016-07-13 01:19:10,720] Repair command #1 finished

[2016-07-13 01:19:10,728] Starting repair command #2, repairing 4 ranges for keyspace dse_system

[2016-07-13 01:19:10,735] Repair session cd8e1080-4897-11e6-9deb-e7c99fc0dbe2 for range (-3074457345618258603,3074457345618258602] failed with error java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.4) is dead: session failed

[2016-07-13 01:19:10,736] Repair session cd8e3790-4897-11e6-9deb-e7c99fc0dbe2 for range (3074457345618258602,-9223372036854775808] failed with error java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.4) is dead: session failed

[2016-07-13 01:19:10,737] Repair session cd8eacc0-4897-11e6-9deb-e7c99fc0dbe2 for range (-9223372036854775808,-7422755166451980864] failed with error java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.4) is dead: session failed

[2016-07-13 01:19:10,738] Repair session cd8ed3d0-4897-11e6-9deb-e7c99fc0dbe2 for range (-7422755166451980864,-3074457345618258603] failed with error java.io.IOException: Cannot proceed on repair because a neighbor (/127.0.0.4) is dead: session failed

[2016-07-13 01:19:10,738] Repair command #2 finished

So best way to do the Node deletion is with decommission once the node show decommission you can remove the node.

If you recollect the Oracle node delete we first deconfig the crs and then delete the node.

-Thanks

GEEK DBA

Cassandra for Oracle DBA’s Part 5 – High Availability

High Availability is the key for any distributed systems which includes the load balancing. In High Availability , one can think of Active-Active Nodes or Active - Passive Nodes, Cassandra and Oracle RAC falls into the first category, active-active mode. But Cassandra have important architectural differences unlike oracle rac

  • Cassandra does not use shared storage, this helps the data can be distributed across nodes which is classic example of distributed storage processing.
  • Cassandra does not adhere to Master Slave technology, this helps to reduce the communication with master node every time at database level and at cluster level also
  • Cassandra does not use private network, indeed node communication happens using common network with gossip protocol
  • Cassandra does not do a heart beat to every node in the cluster which can be a clumsy network congestion as like in Oracle RAC

Consider this example of Cluster Network in Oracle RAC, where each node pings to other node in the cluster and that is the reason you must have low latency network used for RAC

image

But with Cassandra every node will be configured to know about a seed node rather whole nodes in cluster, consider this example which reduces the foot print of cluster network usage but also effective communication is possible.

In the cluster, A and B configured as seed nodes and C,D,E configured as non seed nodes (in cassandra.yaml file for each node), so C & D nodes are configured to reach A and E node is configured to reach B for a gossip information i.e node availability information.

When cluster starts, A, B nodes are up and see each other and C,D,E configured to reach seed nodes, once they warm up and each node knows about other nodes 

Cassandra uses Gossip Protocol to reach nodes every 2 seconds and update about the availability of nodes.

Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.

To prevent problems in gossip communications, use the same list of seed nodes for all nodes in a cluster. This is most critical the first time a node starts up. By default, a node remembers other nodes it has gossiped with between subsequent restarts. The seed node designation has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not a single point of failure, nor do they have any other special purpose in cluster operations beyond the bootstrapping of nodes.

In my CCM (developer mode) Node configurations the nodes 1,2,3 configured as seed nodes

root@wash-i-16ca26c8-prod ~/.ccm/geek_cluster/node1/resources/cassandra/conf $ grep seed cassandra.yaml

seed_provider:

  - seeds: 127.0.0.1,127.0.0.2,127.0.0.3

And for one Node1 the heartbeat information from Gossipinfo shows every 2 seconds.

root@wash-i-16ca26c8-prod /scripts/db_reconfiguration $ ccm node1 nodetool gossipinfo | grep heartbeat

  heartbeat:122773

  heartbeat:122775

  heartbeat:122764

  heartbeat:122776

Every 2 seconds considered as a heartbeat information, and failure detector (using PhiFailureDetection method of magical number 0.434) and cassandra default setting of phi_convict_threshold =8 

So it means at every 2 seconds the heartbeat calculates and record last time sample if it exceed from convict thresholds, giving an example, 8/0.434 = 18.4 seconds, that mean the node will be marked down after 18.4 seconds.

In this way, Cassandra ensures the nodes are reachable and available to the cluster and process.

Cassandra for Oracle DBA’s Part 5 : Read and Write Consistency

With the Data not shared model, how does a client or query get a consistent data and how does a write ensure that it is writing right data.

Obvious answer is the maintaining consistency in a database involves in adhering the ACID Properties.

Cassandra follows Atomicity, Isolation, Durability but consistency is an eventual consistency or a Tunable Consistency. 

Having said that, data will eventually become consistent among nodes.

What is eventual consistency? 

ANS: The data will become consistency when required or by on Demand over a period of time. 

But, Does client need to wait until it become consistent?

ANS: No, when client send a request for data, the underlying query have a consistency level for a session level or query level , for which ALL, ANY, ONE etc. should be mentioned,  

For example: select * from Portfolios.PortfolioDemo USING CONSISTENCY ALL|ONE|TWO|THREE|QUORUM|LOCAL QUORUM

When replication factor is set and read happens the co-ordinator that receives the client request will send a request to the nodes that contain the partition range of data and nodes will revert back with data to coordinator node. Whilst its the coordinator node to merge all the parts of the data and check with digest it received from each node and see the data is latest.  If any node data is not matched with digest then that node data will be repaired in backend and so forth become consistent eventually.

I do not want to waste any efforts in rewriting this stuff since Datastax documentation has very good page for Read/Write Paths in Cassandra with all scenarios. 

https://docs.datastax.com/en/cassandra_win/2.2/cassandra/dml/dmlClientRequestsReadExp.html

https://docs.datastax.com/en/cassandra_win/2.2/cassandra/dml/dmlClientRequestsWrite.html

So with Consistency Levels in place and the digest matching and read repair online, the Data that retrieves Cassandra ensures consistency of data can be achieved eventually and depends on your requirement.

Next Post High Availability Comparing with Oracle RAC, cluster interconnect, node evictions etc.

12c Upgrade Gotchas

The following are various gotchas during upgrade or post upgrade to 12c collect from various sources and also few of them in my environment.

1. Change of resource_limit parameter to true by default

   Impact: Resource Manager will be invoked automatically and push the resource limits to the sessions and users may receive cpu/io restriction messages.

2. Global Temporary Tables Statistics session level

   Impact: Earlier versions of Oracle uses shared statistics for global temporary tables, where in 12c the statistics on global temporary tables are session level. 

   that means other users cannot see it

3. HISTGRM$ add column 

Impact: n 12.1, optimizer group added a column to histgrm$ which is a clustered table. For upgrade from 11.2 to 12.1, this column ep_repeat_count is added as a default value column with a not null constraint. Since histgrm$ table on a database could have millions of rows, this addition of a default value column could take a long time since we need to update all existing rows with a default value. This in turn slows down the upgrade significantly to apply the patch 19333670 on 12.1.0.2

4. Drop column with null default value can cause corruption bug 17325413

5. Dynamic Statistics - Performance Problems

   Impact:- Dynamic statistics has been enhanced in Oracle 12c, but this can lead to a longer parse time.

   Automatic dynamic statistics is used more often in 12c which can lead to a parse time increase in the more cases than before.

6. Wri$_optstat_histhead_history Size can cause much time to upgrade, before upgrade purge or reorg the table.

7. Wrong results in group by with outjoin for a bind variable

   alter system set "_optimizer_aggr_groupby_elim"=false scope=both;

   alter system set "_optimizer_reduce_groupby_key"=false scope=both;

8. In memory database - Do not enable unless you have a license

9. DBA_REGISTRY also reports CATALOG and CATPROC in "UPGRADED" status as shown below:

    Solution: run utlrp.sql

10. SPD & Dynamic Sampling more CPU Utilisation & Cause result cache also fill.

11. Oracle 12c has deprecated the sec_case_sensitive_logon parameter. Changing to case sensitive passwords could break your application. 

12. If you have client/JDBC version below 10g , ensure you must update the sqlnet.ora in database host SQLNET.ALLOWED_LOGON_VERSION_SERVER  to 10 or 8 or 9.

     This is due to in earlier versions it was 8 where in 12c it was 11, that means 11g client can connect to db. By keeping it 10 or 8 or 9 you are allowing other version clients to connect to database.

13. Be aware of, By default, In 12c version the auditing runs in mixed-mode, which means you can combine traditional auditing with unified auditing. There is one gotcha    

    to be aware of:

  – if the database has been upgraded, then there will be no Unified Policy enabled!

  – if the database has been newly created, then there will be 2 Unified Policies enabled : ORA_SECURECONFIG and ORA_LOGON_FAILURES. 

Thanks

GEEK DBA

Cassandra for Oracle DBA’s Part 4 – Data Distribution across Nodes

Shared Disk architecture in RAC where the data is shared across the nodes and at any point of time , every instance will give the same data using cache fusion.

But in Cassandra, the important part is storage is not shared storage, instead its sharded means partitioned and stored in each node. 

And when you see RAC architecture we see an horizontal landscape (the hub nodes), like below figure 1, where in Cassandra nodes formed as ring (the peer architecture or hub-leaf spoke technology). In a ring architecture every node acts as coordinator there is no master in cluster unlike RAC.

Note: In 12c we use the same mechanism called Flex Cluster, but still the Hub Nodes aka database uses shared storage. Do not  confuse with it.

Diagrams RAC & Cassandra

Image result for flex cluster oracle rac architecture                       

             RAC 12c Flex Cluster Architecture                                          Cassandra Sharded Cluster Nodes

 

Well, if the data is not shared , we as a DBA the common questions that come to our mind are, 🙂

1. How does the data is distributed to the nodes?

2. How does nodes see a consistent data if a read happens? 

3. How does node knows if a write happen in one node the other node knows about it?

4. How about data availability?What if the one of the node corrupted or disk failed?

Before to answer this questions, Lets catch up our ASM Diskgroup Mirroring and Striping methods

ASM uses diskgrup level redundancy aka mirroring with Normal (two copies) and High (three copies), so at any time it maintains the data that stored in data diskgroup that many copies. Even a diskgroup/disk failure will not have any data loss.

ASM uses disk level striping so it evenly distribute the data to the number of disks in that diskgroup, when a disk added or deleted the data is distributed to other disk first called re-balancing and then allow disk operations on it. This ensures the data can be read / write evenly to avoid a single point of contention on disks.

Keeping those in mind, Just mark yourself rather diskgroup level, in Cassandra the data is maintained at node level replication and stripingSo each node store its own set of data and may be a replicated data of other nodes.

In Cassandra, Mirroring is achieved by Replication Factor and Striping is achieved by Token Range of the nodes called Partitioner ( I am not covering Vnodes here )

Partitioner/Tokens:- Cassandra uses a hashing mechanism to assign a token range for a node. When a node is added and deleted one must re-assign the tokens (not the vnodes configuration) so that data will be redistributed to the other/new nodes as like the ASM Diskgroup Rebalancing. Now with token in place for each node, there is another thing that required is table level partition key, which is based on a column to store certain partitioned data into a separate node. And then while any DML operation performed that row will be hashed and checked with node token range and insert into that particular node. In this way each node has a non shared data means a part of data for a table.

Cassandra uses different types of hash mechanism to provide token to nodes and A partitioner determines how data is distributed across the nodes in the cluster (including replicas). Basically a partitioner is a function for deriving a token representing a row from its partion key, typically by hashing. Each row of data is then distributed across the cluster by the value of the token. Both the Murmur3Partitioner and RandomPartitioner use tokens to help assign equal portions of data to each node and evenly distribute data from all the tables throughout the ring or other grouping, such as a keyspace. This is true even if the tables use different partition key, such as usernames or timestamps. Moreover, the read and write requests to the cluster are also evenly distributed and load balancing is simplified because each part of the hash range receives an equal number of rows on average.

You can set the partitioner type in cassandra.yaml file using

  • Murmur3Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
  • RandomPartitioner: org.apache.cassandra.dht.RandomPartitioner
  • ByteOrderedPartitioner: org.apache.cassandra.dht.ByteOrderedPartitioner

Mirroring/Replication :- Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed. The total number of replicas across the cluster is referred to as the replication factor. 

Replication Factor can be configured at Key Space Level. i.e A tablespace/schema level, and then each object/table in that Keyspace will maintain that Replicas in nodes. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later. 

Replication Strategies can be determined by Using class option at Key Space Level so that each data center will maintain a set of data replicas. 

Two replication strategies are available:

 

  • SimpleStrategy: Use only for a single data center and one rack. If you ever intend more than one data center, use the NetworkTopologyStrategy.
  • NetworkTopologyStrategy: Highly recommended for most deployments because it is much easier to expand to multiple data centers when required by future expansion.

With above, look at the example with replication factor 3 with single and multiple datacenter.

Mutli-datacenter clusters for geograpically separate data

 

 

 

 

 

 

 

 

 

Now lets answer the questions above.

1. How does the data is distributed to the nodes?

Ans: Data Distribution of nodes is based on Token Range assigned to the node. So each node stores different data.

2. How about data availability?What if the one of the node corrupted or disk failed?

Ans: As such the data is replicated to other nodes, even if a node / disk failed the other nodes can get the data. This way data high availability is maintained.

3. As if the data is not shared how about read consistency and write consistency.

Ans: Cassandra Maintains the eventual consistency in forms of multiple read requests to the nodes if it sees the data digest is not matching. So it try to read the most recent data from other replicated nodes.  

Read Next post for Read/Write Paths.

Thanks

Geek DBA

Cassandra for Oracle DBA’s Part 3 – Architecture

In Oracle (infact for any RDBMS) , will have a memory structures and processes (or thread) to process the data. 

If we take Oracle as an example, lets look the similarities

1. Memory Architecture

Oracle Cassandra (NO-SQL DB) Parameter to Set in Cassandra.yaml file Comments
SGA Java Heap Memory

cassandraenv.sh 

MAX_HEAP_SIZE

SGA is shared global area , where in the Java Heap Memory is used for Cassandra Core Engine

BufferPool Memtables memtable_total_space_in_mb

Memtables in Cassandra used for each single table to store the recently used or modifying data, unlike a large buffer cache

suppose you have DML on 4 tables , it will create 4 Memtables in the memory 

Result Cache Row Cache row_cache_size_in_mb  As like result cache in Oracle, you can store certain rows (entire rows) in table in memory by using rowcache
RowCache Partition Key Cache key_cache_size_in_mb

When an operation has to perform on a block, one must know the Block location in terms of Block Number, Disk number, node name etc.

To achieve this, Oracle relies on Data Dictionary Cache and get block number and file number and then read from the disk.

In Cassandra, Data Stores on disk based on Partition Key Value and this Key will be stored in Partition Key Cache to effectively get the data from disk 

Hash Buckets Bloom Filter bloom_filter_fp_chance

In Oracle to find the data block in buffer cache, Oracle uses mod function to find the right bucket for the data. This will help process not to traverse entire block chain.

Similarly, Cassandra checks the Bloom filter to discover which SSTables are likely to have the request partition data. The Bloom filter is stored in off-heap memory. Each SSTable has a Bloom filter associated with it.

Create table ..... (bloom_filter_fp_chance = 0.01)

NA PartitionSummary

min_index_interval 128 and max_index_interval 2048, or index_interval 128

The partition summary is an off-heap in-memory structure that stores a sampling of the partition index. A partition index contains all partition keys, whereas a partition summary samples every X keys, and maps the location of every Xth key's location in the index file. For example, if the partition summary is set to sample every 20 keys, it will store the location of the first key as the beginning of the SSTable file, the 20th key and its location in the file, and so on.

NA CompressionOffsets  

Compression offset maps holds the offset information for compressed blocks. By default all tables in Cassandra are compressed and more the compression ratio larger the compression offset table.

2. Physical Structures

Oracle Cassandra Parameter Comments
RedoLog CommitLog

commitlog_total_space_in_mb

commitlog_segment_size_in_mb

To maintain the durability of the Transaction any database should write the transaction to off a file like redo log or transaction log or a commit log in case of cassandra. . The purpose of the commitlog is to be able to recreate the memtable after a node crashes or gets rebooted. This is important, since the memtable only gets flushed to disk when it's 'full'
Datafile SSTables

sstable_size_in_mb

sstable_min_threshold

In oracle the table is stored in form of segments and extents and blocks in a physical datafile. And it is contiguous or non-contigous and fits into a single or multiple large datafiles.

Where the SSTables

  • Each table will have a physical SSTable on disk
  • SSTables are immutable, well immutable means it cannot be written again, kind of Write and then Readonly (like temp)
  • So how about a write that need to modify the Table aka SSTable?
  • A transaction will create a new SSTable with new data on same disk with increment the file number
  • Both of this SSTables (aka files) will be merged together called Compaction when required or when reached the number of SSTables threshold
  • In multinode environment, this SSTables are spawned across nodes with data distributed based on Token Range of that node.

SSTables consists of different types:-

  • BloomFilter
  • Index
  • Compression File
  • Statistics File
  • Datafile
  • Summary File
  • TOC.txt

General Architecture

How Read Occurs in Cassandra?

  • Check the memtable
  • Check row cache, if enabled
  • Checks Bloom filter
  • Checks partition key cache, if enabled
  • Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not

    If the partition summary is checked, then the partition index is accessed

  • Locates the data on disk using the compression offset map
  • Fetches the data from the SSTable on disk

How Write Occurs in Cassandra

Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in compaction:

  • Logging data in the commit log
  • Writing data to the memtable
  • Flushing data from the memtable
  • Storing data on disk in SSTables
  • Compaction, by merging the SSTables

3. Logical Structures

Oracle Cassandra Command Comments
Schema Keyspace
CREATE KEYSPACE TEST
  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : n };

Keyspace is a schema a collection of Tables a top level namespace

Replication Factor is used to determine the Mirroring of data in a multinode environment

Table ColumnFamily
CREATE TABLE TEST.Users(
  ID text,
  firstname text,
  lastname text,
  PRIMARY KEY (ID)
)

A table is often called as ColumnFamily in NO-SQL , created with primary key , 

ParitionKey PartitionKey
CREATE TABLE TEST.Users(
  ID text,
  firstname text,
  lastname text,
  PRIMARY KEY (ID)
)

primary key means the partition key to store the data in column pair not the primary key in Oracle.

As like Oracle's Composite Primary Key aka Partitioning you can set multiple columns to the Primary Key

Indexes Index
CREATE INDEX user_state
   ON TEST.Users (lastname);
Creating a simple index whichi inturn creates a file on disk.
Sequence Counter
CREATE TABLE TEST.page_view_counts
  (counter_value counter,
  url_name varchar,
  page_name varchar,
  PRIMARY KEY (url_name, page_name)
);

Here the datatype counter depicts , its a sequence based value, so you just can increment in your dml and code 

ex: UPDATE counterks.page_view_counts SET counter_value = counter_value + WHERE url_name='www.datastax.com' AND page_name='home';

Trigger Trigger
CREATE TRIGGER TEST.myTrigger
  ON myTable
  USING 'org.apache.cassandra.triggers.INDEXCHECK'
In RDBMS, Triggers are stored as compiled code in the memory of the Database called SharedPool/Library Cache, where in the cassandra implementation, To create a trigger, you must first build a jar with a class implementing the ITrigger interface and put it into the triggers directory on every node, then perform a CQL3 CREATE TRIGGER request to tie your trigger to a Cassandra table (or several tables)
Type Type
 CREATE TYPE TEST.address (
  street text,
  city text,
  zip_code int,
  phones set&lt;text&gt;
)
Facilitates handling multiple fields of related information, such as address information: street, city, and postal code.
Undo Tombstones gc_grace_seconds Deletes do not remove the rows and columns immediately and in-place. Instead, Cassandra writes a special marker, called a tombstone, indicating that a row, column, or range of columns was deleted. These tombstones are kept for at least the period of time defined by the gc_grace_seconds per-table setting

Image Courtesy: Datastax

-Thanks

GeekDBA

Cassandra for Oracle DBA’s Part 2 – Three things you need to know

Cassandra has three important differences if compared with RDBMS,

  • NO-SQL
  • Data Stored as Columnar Structure 
  • Distributed Systems with Nothing Shared Architecture

Cassandra is combination of the Distributed Processing Systems (for high availability and scalability) with NO-SQL (to process unstructured data & schema less structure) & with Column Store(to store wide variety and velocity of data), an adoption of Google BigTable & Amazon DynomoDB.

Let's delve into each of these things.

1. NOSQL - Not Only SQL

NoSQL technologies address narrow yet important business needs.  Most NoSQL vendors support structured, semi-structured, or non-structured data which can be very useful indeed. The real value, comes in the fact that NoSQL can ingest HUGE amounts of data, very fast.  Forget Gigabytes, and even Terabytes, we are talking Petabytes!  Gobs and gobs of data!  With clustering support and multi-threaded inner-workings, scaling to the future expected explosion of data will seem a no-brainer with a NoSQL environment in play.  Let’s get excited, but temper it with the understanding that NoSQL is COMPLIMENTARY and not COMPETITIVE to ROW and COLUMN based databases. And also note that NoSQL is NOT A DATABASE but a high performance distributed file system and really great at dealing with lots and lots of data

There are three variants of NOSQL,

 

  • Key Value  – which support fast transaction inserts (like an internet shopping cart); Generally stores data in memory and great for web. applications that need considerable in/out operations
  • Document Store (ex. MongoDB) – which stores highly unstructured data as named value pairs; great for web traffic analysis, detailed information, and applications that look at user behavior, actions, and logs in real time. 
  • Column Store (Cassandra) – which is focused upon massive amounts of unstructured data across distributed systems (think Facebook & Google); great for shallow but wide based data relationships yet fails miserably at ad-hoc queries .

2. Data Stored as Columnar Structure

Imagine a Row vs Column Structure

Creating a Table in RDBMS                                         Creating a Table using CQL  (Thirft naming convention: Column Family) with Column Database like Cassandra

Create table emp                                                          Create table emp

(id number primary key,                                                (id int,

name varchar2(10),                                                       name text,

age number,                                                                 age number,

interests varchar2(100));                                               interests text,

                                                                                        PRIMARY KEY((id), name, age, interests)  

The Nomenclature is as follows,

  • id:  The Partition Key is responsible for data distribution across your nodes. Data sorted in the Columns with Key value in each separate files on the disk.
  • name:age:interests: The column key values that store for each partition

3. Distributed Systems with  Shared Nothing Architecture

  • For Cassandra, the usage of distributed system is Shared Nothing Architecture where in  Oracle RAC Architecture uses Shared Disk Architecture.
  • As you see, the data is stored in each node locally
  • And unlike RAC master-slave architecture(cluster & database instance level), this is peer-peer architecture and every node act as coordinator

  • As such the data is stored locally , data will be evenly distributed (stripes) to each node when data writes into tables.
  • Each Node contains the Token Range to accept and store data
  • Token Range is determined by Partition Schemes used by Cassandra Random Partitioner, Murmur partitioner, OrderPreserving Partitioner etc.
  • If a node is added/deleted, as like in ASM Diskgroup Re-balancing, Cassandra will re-balance node data to other nodes by rearranging the token assigned to each node.

  • But as such the data stores locally, there is likely hood that Data may lost in case of failure
  • To avoid this, while inserting the data , Cassandra uses Replicaton Factor for each Schema/Keyspace
  • So that all tables in that schema/keyspace will have the data mirrored (ASM Disk Mirroring) across nodes
  • This avoids single point of failure and as well as contention on Read/Write I/O.

Cassandra mirrors Data to Nodes based on Two things, Strategy and replication factor

    • SimpleStrategy (Single Data Centre), NetworkTopology Strategy with multiple DC or Single DC with specific identification of IP & Servers, so that data mirrors across data centres.
    • Replication Factor: How many copies of data should be maintained in cluster.

    In the below diagram, with 5 Node Cassandra , we have created a Keyspace Called Messages with replication placement strategy of Simple & Replication Factor 1

    Replication Factor: 1 : In this case 1, means the data will be not be mirrored and you see the IL, KY with their county has been written to different nodes

    In case of failure of a node, the data in that node will not be available to the clients means a data loss/unavailability.

    Example of a cluster with a composite partition key

    Take another example, In the below diagram, with 5 Node Cassandra , we have created a Keyspace Called Messages with replication placement strategy of Simple & Replication Factor 3

    Replication Factor: 3 : In this case 3, means the data will be mirrored to three nodes at least in the cluster and you see the IL, KY with their county has been written to different node,

    In case of failure of a node, the data in that node will not be available but the co-ordinator will get the data from other replicas.

    But in case of data centre loss, it is likely hood that whole cluster is not available.

     

    Take another example, In the below diagram, with 5 Node Cassandra , we have created a Keyspace Called Messages with replication placement strategy of NetworkTopology Strategy & Replication Factor as, for us-west data center is 3 and us-east is 2.

    As you see below, the data is spawned/mirrored three copies in us-west and two copies in us-east

    Mutli-datacenter clusters for geograpically separate data

     

    That's all about in this post, There's nothing new I have written above most of things available in Internet, but to have a series of posts I would like to have an introduction first and then dig into more for DBA's.

    Lets take a closer look at architecture of Cassandra and its components for each node in next post.

    Image Courtesy: Random Blogs in internet & Apache Cassandra website.

    GeekDBA

     

    Cassandra for Oracle DBA’s Part 6 – Installing Cassandra

    Installing Cassandra is straight forward, With CCM, is a tool to simulate a Cassandra cluster in a single host, Ideal for test environments or you want to make your hands on in Cassandra.

    Further to install Cassandra in production size, you can directly download the Cassandra from Apache.org or Datastax for an enterprise version.

    I cover here both of these methods.

    For CCM Installation,

    Step 1: Upgrade Python to 2.7 (this is must for ccm based installations also we have to do alternate install for python to upgrade 2.7 if not this will break your libraries.)

       cd /usr/src

       wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz

       tar xzf Python-2.7.10.tgz

      cd Python-2.7.10

      ./configure

      make altinstall

      ls -ltr /usr/bin/python*

      ls -ltr /usr/local/bin/python*

      ln -s /usr/local/bin/python2.7 /usr/local/bin/python

      ln -s /usr/local/bin/python2.7 /usr/local/bin/python

      wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py

      sudo /usr/local/bin/python2.7 ez_setup.py

      sudo /usr/local/bin/easy_install-2.7 pip

      sudo easy_install six

      sudo easy_install pyyaml

     

    Step 2: Download the CCM Package from Git hub

    git clone https://github.com/pcmanus/ccm.git

    or

    pip install ccm

    change the setup.py to point to /usr/local/bin/python instead of /usr/bin/env python (this is important if not CCM will not work)

    cd ccm; sudo ./setup.py install; cd ..

    Step 3: Creating a CCM cluster for DSE stack (this command only download the Cassandra to the directory you are running into)

    ccm create --version 1.2.0 --nodes 3 --start test

    or 

    ccm create dse_cluster --dse --dse-username=emaildi --dse-password=password -v 4.5.2 -n 3 -s

    Step 4: Verify the installation

    root@wash-i-16ca26c8-prod ~/.ccm/repository/4.5.2/resources/cassandra/conf $ ccm node4 nodetool ring

    Note: Ownership information does not include topology; for complete information, specify a keyspace

    Datacenter: Cassandra

    ==============

    Address    Rack        Status State   Load            Owns                Token

    127.0.0.1  rack1       Up     Normal  3.05 MB         33.33%              -9223372036854775808

    127.0.0.2  rack1       Up     Normal  2.99 MB         33.33%              -3074457345618258603

     

    127.0.0.3  rack1       Up     Normal  3.5 MB          33.33%              3074457345618258602

     

    Installing Cassandra Cluster Manually on each of the node

    Step 1: Downloading and Installing Cassandra

     

    sudo su - cassandra

    wget http://mirrors.ae-online.de/apache/cassandra/2.1.3/apache-cassandra-2.1.3-bin.tar.gz

    sudo tar -xvf /home/cassandra/apache-cassandra-2.1.3-bin.tar.gz -C /usr/local/

    cd /usr/local/

    sudo mv apache-cassandra-2.1.3 cassandra-2.1.3

    sudo ln -s /usr/local/cassandra-2.1.3 /usr/local/cassandra

    sudo chown -R cassandra:cassandra /usr/local/cassandra-2.1.3

    xport JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

    export CASSANDRA_INSTALL=/usr/local/cassandra

    export CASSANDRA_HOME=$CASSANDRA_INSTALL

    Just edit the $CASSANDRA_HOME/conf/cassandra.yaml file vim $CASSANDRA_HOME/conf/cassandra.yaml and replace the following keys with proper values:

    cluster_name: 'MyFirstCluster'

    seeds: "10.64.200.47, 10.64.200.48"

    listen_address: 10.64.200.47

    rpc_address: 10.64.200.47

    Note: Seeds address are those nodes performing as seed, Listen and RPC address should be the node IP which you have installed.

    Step 2: Start Cassandra on all nodes

    $CASSANDRA_HOME/bin/cassandra -f

    Step 3: Check status of cluster

    $CASSANDRA_HOME/bin/nodetool status

    -Thanks

    Geek DBA