SuperColumn TMD到底是什么？Cassandra数据模型介绍（三），

和通数据库htsjk.Com2019-09-12 22:45 来源:未知阅读:6282 评论 101 热度4

标签：

SuperColumn TMD到底是什么？Cassandra数据模型介绍（三），

SuperColumn TMD到底是什么？Cassandra数据模型介绍

（三）

Translated by leo zheng

原文地址：WTF is a SuperColumn? An Intro to the Cassandra Data Model By Arin Sarkissian Digg

TaggedPosts ColumnFamily
好的 —— 到这儿，事情变得有趣起来了。这个ColumnFamily将为我们处理较多的内容（This ColumnFamily is going to do some heavy lifting for us）。它不仅仅用来存储关联关系（译注：tag和blog之间的关联关系），而且允许我们通过某个tag来获取所有的已排序的blog（还记得我们前面提到的排序规则吗？）

有一个设计点我想要指出：在我们的系统中，每一个BlogEntry都会拥有一个称为“__notag__”的tag。对每一个BlogEntry打上“__notag__”标签使得我们可以使用这个ColumnFamily来按顺序存储所有的Blog。我们有点投机取巧，但是这确实使得我们可以用一个简单的ColumnFamily 来同时实现”获取所有最近发布的博客“ 以及”获取所有最近发布的标记为‘foo’的博客“这样的需求。

根据这个数据模型，如果一篇博客拥有3个tag，那么它将在4行中拥有相关Column。。。每一行对应一个tag，还有一行对应“__notag__”。
由于我们需要按照时间倒序的方式来显示blog，因此我们需要确保每个Column的name是一个time UUID，并且设置ColumnFamily的CompareWith属性为TimeUUIDType。这将按时间顺序排列Columns，满足我们的”按时间倒序排列“的需求：）因此”获取最近10篇标记为‘foo’的blog“将会十分高效。

        现在，当我们想要显示最近的10篇blog时（例如：显示在首页），我们只需：
        1. 获取key为”__notag__“（代表所有blog的tag）的行中的最后10个Column
        2. 循环遍历这些Column
        3. 在循环过程中，我们知道每个Column的value就是BlogEntries ColumnFamily中每一行的key
        4. 我们使用这个值从BlogEntries ColumnFamily中获取这篇blog对应的行。这让我们得到这篇blog的所有数据
         5. 我们刚才获取的BlogEntries 行中有一个称为“author”的column，它的值是Authors ColumnFamily的key，我们需要用它来获取author的数据
        6. 到此，我们已经获得了blog和author的信息
        7. 接下来，我们将分割“tags”Column的值来得到tags的列表
        8. 现在，我们已经获得了需要显示的blog的所有信息（除了评论 —— 它不在同一页上）（ no comments yet – this aint the permalink page）
        依据以上的步骤，我们可以获取任何tag关联的blog。。。“所有的blog”和“标记为‘foo’的blog”。挺不错的。

<!--
    ColumnFamily: TaggedPosts
    一个辅助索引，用来决定哪些blog和一个tag关联

    Row Key => tag
    Column Names: a TimeUUIDType
    Column Value: row key into BlogEntries CF

    访问: 获取打上了‘foo’标签的blog

    我们使用这个ColumnFamily来决定在某个tag页面要显示哪些blog
    我们有点投机（We'll be a bit ghetto），使用__notag__ 来表示“没有tag限制”。在这儿，一篇blog就是一个column。。。这意味着每一篇blog对应#tags+1个column


    TaggedPosts : { // CF
        // 打上了“guitar”标签的blog
        guitar : {  // tag的名称是这一行的key
            // column的name是 TimeUUIDType, value 是 BlogEntries的key
            timeuuid_1 : i-got-a-new-guitar,
            timeuuid_2 : another-cool-guitar,
        },
        // 这儿是所有的blog
        __notag__ : {
            timeuuid_1b : i-got-a-new-guitar,

            // 这篇blog在“guitar”那行也有
            timeuuid_2b : another-cool-guitar,

            // 这篇blog在“movie”那行也有
            timeuuid_2b : scream-is-the-best-movie-ever,
        },
        // blog entries tagged "movie"
        movie: {
            timeuuid_1c: scream-is-the-best-movie-ever
        }
    }
-->
<ColumnFamily CompareWith="TimeUUIDType" Name="TaggedPosts"/>

Comments ColumnFamily
        最后，我们需要弄清楚怎样构造comments。这儿，我们会用到SuperColumns。
        我们用一行表示一篇blog。每一行的key就是blog的slug。在每一行中，我们用一个SuperColumn来表示一条评论。SuperColumn的name是一个UUID，我们将使用TimeUUIDType。这能确保每一篇blog的所有评论按时间倒序排列。每个SuperColumn内的各个Column表示评论的各个属性（评论者的姓名、评论的时间等）。
        看到了吧，这确实十分简单。。。没什么花哨的东西。

<!--
    ColumnFamily: Comments
   我们把所有的评论存储在这

    Row key => row key of the BlogEntry
    SuperColumn name: TimeUUIDType

    访问: 获取一篇blog的所有评论

    Comments : {
        // scream-is-the-best-movie-ever这篇blog的所有评论
        scream-is-the-best-movie-ever : { // row key = row key of BlogEntry
            // 最早的评论排在最前面
            timeuuid_1 : { // SC Name
                // Columns对应评论的属性
                commenter: Joe Blow,
                email: joeb@example.com,
                comment: you're a dumb douche, the godfather is the best movie ever
                commentTime: 1250438004
            },

            ... scream-is-the-best-movie-ever的更多评论

            // 最新的评论排在最后
            timeuuid_2 : {
                commenter: Some Dude,
                email: sd@example.com,
                comment: be nice Joe Blow this isnt youtube
                commentTime: 1250557004
            },
        },

        // i-got-a-new-guitar这篇blog的所有评论
        i-got-a-new-guitar : {
            timeuuid_1 : { // SC Name
                // Columns对应评论的属性
                commenter: Johnny Guitar,
                email: guitardude@example.com,
                comment: nice axe dawg...
                commentTime: 1250438004
            },
        }

        ..
        // 其它blog对应的SuperColumn
    }
-->
<ColumnFamily CompareWith="TimeUUIDType" ColumnType="Super"
    CompareSubcolumnsWith="BytesType" Name="Comments"/>

哇哦！（Woot!）

就是这样。我们的小小blog系统已经构造好了，可以准备运行了。以上的内容确实需要好好消化，在最后，是一个简短的 storage-conf.xml的配置：

<Keyspace Name="BloggyAppy">
        <!-- other keyspace config stuff -->

        <!-- CF definitions -->
        <ColumnFamily CompareWith="BytesType" Name="Authors"/>
        <ColumnFamily CompareWith="BytesType" Name="BlogEntries"/>
        <ColumnFamily CompareWith="TimeUUIDType" Name="TaggedPosts"/>
        <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"
            CompareSubcolumnsWith="BytesType" ColumnType="Super"/>
    </Keyspace>

现在，你要做的是弄清楚如何在Cassandra中存储和获取数据：）这是通过Thrift接口来实现的。Thrift接口API的wiki页面上对各种使用方法有非常不错的介绍，因此我就不赘述了。通常情况下，你只需要编译 cassandra.thrift 文件，然后通过生成的代码来访问各个方法。但是你也可以通过Ruby客户端或者Python客户端来访问。

好了。。。希望这些对你有帮助，希望你已经理解了到底什么是SuperColumn，并开始搭建一些不错的项目。